Context¶

Post Graduate Program in Data Science and Business Analytics - Supervised Learning: Linear Regression of a fictious company, Cars4U, and their Sales Data.

Project¶

From GreatLearning:

There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholds in this market.

In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.

As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.

Project Assumptions¶

  1. Cars4U is a tech-start-up that is also a business that sells used cars.
  2. The past car price is a good predictor of future car price.
  3. A linear regression model can accurately predict the future car price by training and testing on sample data.
  4. The data is only a sample and may have factors which are not casual to the price variance in the population but we assume that it is a representative sample that can be used to draw inferences and make predictions.

Project Objectives¶

  1. Explore and visualize the dataset.

  2. Build a linear regression model to predict the prices of used cars.

  3. Generate a set of insights and recommendations that will help the business.

Problems & Questions¶

The business does not know the market price of every used vechicle and the future of those prices and so it cannot come up with a pricing model that can effectively predict the price and therefore give good suggestions to customers on how much to spend.

The problem is also related to the Indian Market. Other influences from exterior markets will be ignored. This is an experiment. The background of the Indian car market would be good to know in this case.

The dataset aims to answer the following key questions:

  • What are the predictive variables actually affecting used car price?

  • What can we avoid looking at to save time?

  • What do these relationships tell us about how to consult the selling agent on how to price their used cars?

Intuitive Hypothesis¶

At a first glance of the dataset, I can say that I think that the most predictive variables on price will be Year, Name, and Kilometers_Driven. This is generally thought of as the questions to ask about a car: What make and model? What year or how only is it? How many miles does it have on it? These usually give us a ball-park estimate of the cars value.

Data Dictionary¶

used_cars_data.csv - given by Post Graduate Program in Data Science and Business Analytics. Contains information about fictious cars.

S.No. : Serial Number.

Name : Name of the car which includes Brand name and Model name.

Location : The location in which the car is being sold or is available for purchase (Cities).

Year : Manufacturing year of the car.

Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.

Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG).

Transmission : The type of transmission used by the car. (Automatic / Manual).

Owner : Type of ownership.

Mileage : The standard mileage offered by the car company in kmpl or km/kg or number of miles/kilometers a car can travel in one liter/kilogram of fuel.

Engine : The displacement volume of the engine in CC.

Power : The maximum power of the engine in bhp. From GreatLearning Expert:"BHP or Brake Horse Power is the unit of power of an engine which determines the acceleration and top speed of a vehicle. Higher the BHP higher the power of an engine."

Seats : The number of seats in the car.

New_Price : The price of a new car of the same model in INR Lakhs.(1 Lakh = $100, 000 INR) If it's lower than the used price then not worth it. *the use the ($) sign to denote monetary units and not USD.

Price : The price of the used car in INR Lakhs (1 Lakh = $100, 000 INR)


Table of Contents: Analysis Steps¶


1. Data Collection

2. Data Pre-processing

3. Exploratory Data Analysis

4. Model building - Linear Regression

5. Statistics

6. Best Model Predicting

7. Inferences

8. Insights and Recommendations

9. Project Improvement

Data Collection¶

Import Modules and Set Rules¶

In [1]:
# Install needed libraries
#!pip3 install pandas_profiling
!pip install scipy==1.6.1
!pip install jupyter_nbextensions_configurator
%load_ext nb_black

# Warning packages
import warnings

warnings.filterwarnings("ignore")

# import Python packages
import itertools
from itertools import cycle, chain

# Data analysis packages
import pylab
import pandas as pd
from pandas_profiling import ProfileReport
import numpy as np

# Data visualization packages
import seaborn as sns
import matplotlib.pyplot as plt

# To enable plotting graphs in Jupyter notebook
%matplotlib inline
plt.rcParams["figure.figsize"] = [12, 8]  # set default figure size to 10" x 5"

# for statistical packages
from sklearn.preprocessing import RobustScaler
from sklearn import linear_model  # for working with polynomial processing
from sklearn.linear_model import LinearRegression  # To build linear regression_model
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
from scipy import stats
import scipy.stats as stats
from scipy.stats import mode, norm, pearsonr, tmean  # for a Pearson Correlation matrix

# Sklearn package's randomized data splitting function
from sklearn.model_selection import train_test_split


# To check model performance
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score,
)

from sklearn.preprocessing import PolynomialFeatures  # for adding features to the model
from sklearn.preprocessing import (
    StandardScaler,
)  # for standard scaling of normalized data (assumed for Linear Regression model)
import pylab  # for making a probability plot

"""perform linear regression using *statsmodels*, 
a Python module that provides functions for the estimation of many statistical models, 
as well as for conducting statistical tests, and statistical data exploration.
"""

# limit or remove from the number of displayed columns and rows.
# This is so I can see the entire dataframe when I print it
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 200)
Requirement already satisfied: scipy==1.6.1 in c:\users\joe\anaconda3\lib\site-packages (1.6.1)
Requirement already satisfied: numpy>=1.16.5 in c:\users\joe\anaconda3\lib\site-packages (from scipy==1.6.1) (1.19.5)
Requirement already satisfied: jupyter_nbextensions_configurator in c:\users\joe\anaconda3\lib\site-packages (0.4.1)
Requirement already satisfied: jupyter-contrib-core>=0.3.3 in c:\users\joe\anaconda3\lib\site-packages (from jupyter_nbextensions_configurator) (0.3.3)
Requirement already satisfied: jupyter-core in c:\users\joe\appdata\roaming\python\python38\site-packages (from jupyter_nbextensions_configurator) (4.7.0)
Requirement already satisfied: traitlets in c:\users\joe\appdata\roaming\python\python38\site-packages (from jupyter_nbextensions_configurator) (5.0.5)
Requirement already satisfied: tornado in c:\users\joe\appdata\roaming\python\python38\site-packages (from jupyter_nbextensions_configurator) (6.1)
Requirement already satisfied: pyyaml in c:\users\joe\anaconda3\lib\site-packages (from jupyter_nbextensions_configurator) (5.4.1)
Requirement already satisfied: notebook>=4.0 in c:\users\joe\anaconda3\lib\site-packages (from jupyter_nbextensions_configurator) (6.3.0)
Requirement already satisfied: setuptools in c:\users\joe\anaconda3\lib\site-packages (from jupyter-contrib-core>=0.3.3->jupyter_nbextensions_configurator) (52.0.0.post20210125)
Requirement already satisfied: nbconvert in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (6.0.7)
Requirement already satisfied: ipykernel in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (5.3.4)
Requirement already satisfied: terminado>=0.8.3 in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (0.9.4)
Requirement already satisfied: jinja2 in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (2.11.3)
Requirement already satisfied: prometheus-client in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (0.10.1)
Requirement already satisfied: nbformat in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (5.1.3)
Requirement already satisfied: ipython-genutils in c:\users\joe\appdata\roaming\python\python38\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (0.2.0)
Requirement already satisfied: Send2Trash>=1.5.0 in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (1.5.0)
Requirement already satisfied: pyzmq>=17 in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (20.0.0)
Requirement already satisfied: jupyter-client>=5.3.4 in c:\users\joe\appdata\roaming\python\python38\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (6.1.7)
Requirement already satisfied: argon2-cffi in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (20.1.0)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\joe\anaconda3\lib\site-packages (from jupyter-client>=5.3.4->notebook>=4.0->jupyter_nbextensions_configurator) (2.8.1)
Requirement already satisfied: pywin32>=1.0 in c:\users\joe\anaconda3\lib\site-packages (from jupyter-core->jupyter_nbextensions_configurator) (227)
Requirement already satisfied: six>=1.5 in c:\users\joe\anaconda3\lib\site-packages (from python-dateutil>=2.1->jupyter-client>=5.3.4->notebook>=4.0->jupyter_nbextensions_configurator) (1.15.0)
Requirement already satisfied: pywinpty>=0.5 in c:\users\joe\anaconda3\lib\site-packages (from terminado>=0.8.3->notebook>=4.0->jupyter_nbextensions_configurator) (0.5.7)
Requirement already satisfied: cffi>=1.0.0 in c:\users\joe\anaconda3\lib\site-packages (from argon2-cffi->notebook>=4.0->jupyter_nbextensions_configurator) (1.14.5)
Requirement already satisfied: pycparser in c:\users\joe\anaconda3\lib\site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.0->jupyter_nbextensions_configurator) (2.20)
Requirement already satisfied: ipython>=5.0.0 in c:\users\joe\anaconda3\lib\site-packages (from ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (7.22.0)
Requirement already satisfied: pygments in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (2.7.3)
Requirement already satisfied: jedi>=0.16 in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (0.17.2)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (3.0.8)
Requirement already satisfied: colorama in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (0.4.4)
Requirement already satisfied: backcall in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (0.2.0)
Requirement already satisfied: pickleshare in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (0.7.5)
Requirement already satisfied: decorator in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (4.4.2)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in c:\users\joe\appdata\roaming\python\python38\site-packages (from jedi>=0.16->ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (0.7.1)
Requirement already satisfied: wcwidth in c:\users\joe\appdata\roaming\python\python38\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (0.2.5)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\joe\anaconda3\lib\site-packages (from jinja2->notebook>=4.0->jupyter_nbextensions_configurator) (1.1.1)
Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.5.3)
Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (1.4.3)
Requirement already satisfied: jupyterlab-pygments in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.1.2)
Requirement already satisfied: defusedxml in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.7.1)
Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.8.4)
Requirement already satisfied: entrypoints>=0.2.2 in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.3)
Requirement already satisfied: testpath in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.4.4)
Requirement already satisfied: bleach in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (3.3.0)
Requirement already satisfied: nest-asyncio in c:\users\joe\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (1.5.1)
Requirement already satisfied: async-generator in c:\users\joe\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (1.10)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\joe\anaconda3\lib\site-packages (from nbformat->notebook>=4.0->jupyter_nbextensions_configurator) (3.2.0)
Requirement already satisfied: attrs>=17.4.0 in c:\users\joe\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat->notebook>=4.0->jupyter_nbextensions_configurator) (20.3.0)
Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\joe\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat->notebook>=4.0->jupyter_nbextensions_configurator) (0.17.3)
Requirement already satisfied: webencodings in c:\users\joe\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.5.1)
Requirement already satisfied: packaging in c:\users\joe\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (20.9)
Requirement already satisfied: pyparsing>=2.0.2 in c:\users\joe\anaconda3\lib\site-packages (from packaging->bleach->nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (2.4.7)
In [2]:
# check scipy version for 1.6.1 for using attributes of testing functions
import scipy

scipy.__version__
Out[2]:
'1.6.1'

Import and View¶

S.No. of the vehicle is not giving us additional useful information about the used cars that we will need at this level of analysis (trend and prediction) and is acting effectively as an index, which we already have, so I will set it up as the Index in the read_csv.

In [3]:
# import dataset and make a dataframe
df = pd.read_csv("used_cars_data.csv", index_col="S.No.")

pd.set_option(
    "display.float_format", lambda x: "%.2f" % x
)  # To supress numerical display in scientific notations
In [4]:
# Create a copy of the dataframe
df = df.copy()

# View shape of the dataframe
print(f"The dataframe shape is {df.shape}")

# I'm now going to look at 10 random rows
# I'm setting the random seed via np.random.seed so that
# I get the same random results every time
np.random.seed(1)
df.sample(n=10)
The dataframe shape is (7253, 13)
Out[4]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Mileage Engine Power Seats New_Price Price
S.No.
2397 Ford EcoSport 1.5 Petrol Trend Kolkata 2016 21460 Petrol Manual First 17.0 kmpl 1497 CC 121.36 bhp 5.00 9.47 Lakh 6.00
3777 Maruti Wagon R VXI 1.2 Kochi 2015 49818 Petrol Manual First 21.5 kmpl 1197 CC 81.80 bhp 5.00 5.44 Lakh 4.11
4425 Ford Endeavour 4x2 XLT Hyderabad 2007 130000 Diesel Manual First 13.1 kmpl 2499 CC 141 bhp 7.00 NaN 6.00
3661 Mercedes-Benz E-Class E250 CDI Avantgrade Coimbatore 2016 39753 Diesel Automatic First 13.0 kmpl 2143 CC 201.1 bhp 5.00 NaN 35.28
4514 Hyundai Xcent 1.2 Kappa AT SX Option Kochi 2016 45560 Petrol Automatic First 16.9 kmpl 1197 CC 82 bhp 5.00 NaN 6.34
599 Toyota Innova Crysta 2.8 ZX AT Coimbatore 2019 40674 Diesel Automatic First 11.36 kmpl 2755 CC 171.5 bhp 7.00 28.05 Lakh 24.82
186 Mercedes-Benz E-Class E250 CDI Avantgrade Bangalore 2014 37382 Diesel Automatic First 13.0 kmpl 2143 CC 201.1 bhp 5.00 NaN 32.00
305 Audi A6 2011-2015 2.0 TDI Premium Plus Kochi 2014 61726 Diesel Automatic First 17.68 kmpl 1968 CC 174.33 bhp 5.00 NaN 20.77
4582 Hyundai i20 1.2 Magna Kolkata 2011 36000 Petrol Manual First 18.5 kmpl 1197 CC 80 bhp 5.00 NaN 2.50
5434 Honda WR-V Edge Edition i-VTEC S Kochi 2019 13913 Petrol Manual First 17.5 kmpl 1199 CC 88.7 bhp 5.00 9.36 Lakh 8.20

Observations:

  1. We have a lot of variablility in the Name column.
  2. The Mileage, Power & New_Price has numeric and string values that will need to be processed.
  3. New_Price has many missing values and we will need to handle this.

It looks lke we are looking at a lot of string values that are not good for modeling in a linear regression model.

In [5]:
# View top 5 rows of dataframe
df.head()
Out[5]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Mileage Engine Power Seats New_Price Price
S.No.
0 Maruti Wagon R LXI CNG Mumbai 2010 72000 CNG Manual First 26.6 km/kg 998 CC 58.16 bhp 5.00 NaN 1.75
1 Hyundai Creta 1.6 CRDi SX Option Pune 2015 41000 Diesel Manual First 19.67 kmpl 1582 CC 126.2 bhp 5.00 NaN 12.50
2 Honda Jazz V Chennai 2011 46000 Petrol Manual First 18.2 kmpl 1199 CC 88.7 bhp 5.00 8.61 Lakh 4.50
3 Maruti Ertiga VDI Chennai 2012 87000 Diesel Manual First 20.77 kmpl 1248 CC 88.76 bhp 7.00 NaN 6.00
4 Audi A4 New 2.0 TDI Multitronic Coimbatore 2013 40670 Diesel Automatic Second 15.2 kmpl 1968 CC 140.8 bhp 5.00 NaN 17.74
In [6]:
# View bottom 5 rows of dataframe
df.tail()
Out[6]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Mileage Engine Power Seats New_Price Price
S.No.
7248 Volkswagen Vento Diesel Trendline Hyderabad 2011 89411 Diesel Manual First 20.54 kmpl 1598 CC 103.6 bhp 5.00 NaN NaN
7249 Volkswagen Polo GT TSI Mumbai 2015 59000 Petrol Automatic First 17.21 kmpl 1197 CC 103.6 bhp 5.00 NaN NaN
7250 Nissan Micra Diesel XV Kolkata 2012 28000 Diesel Manual First 23.08 kmpl 1461 CC 63.1 bhp 5.00 NaN NaN
7251 Volkswagen Polo GT TSI Pune 2013 52262 Petrol Automatic Third 17.2 kmpl 1197 CC 103.6 bhp 5.00 NaN NaN
7252 Mercedes-Benz E-Class 2009-2013 E 220 CDI Avan... Kochi 2014 72443 Diesel Automatic First 10.0 kmpl 2148 CC 170 bhp 5.00 NaN NaN

Observations:

Looks like there are both NaN and nan values present. This will need to get handled separately.

Data Pre-processing¶

  1. Data Format
    • Data dimensions
    • Dtypes
  2. Data Consistency
    • Sanity, Missing, Duplicates checking.
    • Distribution and skewness
    • Outliers
  3. Feature Engineering
    • Variable transformations (numerics, logs, exponentials..)
    • Feature extraction to meet model assumptions/requirements (make model inputs meaningful, significant coefficients, dropping highly correlated, space transformation....)

Data Format¶

Getting all the features into numeric form for modeling.

Data Dimensions¶

In [7]:
# get dataframe shape
df.shape
Out[7]:
(7253, 13)
In [8]:
# get dtypes
df.dtypes
Out[8]:
Name                  object
Location              object
Year                   int64
Kilometers_Driven      int64
Fuel_Type             object
Transmission          object
Owner_Type            object
Mileage               object
Engine                object
Power                 object
Seats                float64
New_Price             object
Price                float64
dtype: object
In [9]:
# Get the information on the df
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7253 entries, 0 to 7252
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               7253 non-null   object 
 1   Location           7253 non-null   object 
 2   Year               7253 non-null   int64  
 3   Kilometers_Driven  7253 non-null   int64  
 4   Fuel_Type          7253 non-null   object 
 5   Transmission       7253 non-null   object 
 6   Owner_Type         7253 non-null   object 
 7   Mileage            7251 non-null   object 
 8   Engine             7207 non-null   object 
 9   Power              7207 non-null   object 
 10  Seats              7200 non-null   float64
 11  New_Price          1006 non-null   object 
 12  Price              6019 non-null   float64
dtypes: float64(2), int64(2), object(9)
memory usage: 793.3+ KB

Observations:

Many different dtypes that will need to be checked through and maybe changed. Looking at the samples above we can say that the S.No.,Kilometers_Driven,Engine, and Seats, is meant clearly to be an int64, while Name, Location``Fuel_Type,Owner_Type, Year, and Transmission, are string categoricals, and Mileage,Power, are floats. New_Price and Price since they are given a conversion rate in the description, I can decide to conver them to not. Given that the other units are metric, I will keep the dollar value to the Indian currency. This will leave them as dtype=float.

Insights:

  • The Transmission column has 2 unique values - "Manual" and "Automatic"
  • The Engine column has 150 unique values, i.e., the data is collected from 150 Engine sizes.
  • Year ranges from 1996 to 2019.
  • Average Power is 112.31 CC's.
In [10]:
# Get the dataframe columns list
df.columns
Out[10]:
Index(['Name', 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type',
       'Transmission', 'Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats',
       'New_Price', 'Price'],
      dtype='object')

Dtypes¶

In [11]:
cat_vars = [
    "Name",
    "Location",
    "Fuel_Type",
    "Transmission",
    "Owner_Type",
    # "Year",
]

for colname in cat_vars:
    df[colname] = df[colname].astype("category")

df.info()  # Explore dataframe information, check dtype, and nulls.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7253 entries, 0 to 7252
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Name               7253 non-null   category
 1   Location           7253 non-null   category
 2   Year               7253 non-null   int64   
 3   Kilometers_Driven  7253 non-null   int64   
 4   Fuel_Type          7253 non-null   category
 5   Transmission       7253 non-null   category
 6   Owner_Type         7253 non-null   category
 7   Mileage            7251 non-null   object  
 8   Engine             7207 non-null   object  
 9   Power              7207 non-null   object  
 10  Seats              7200 non-null   float64 
 11  New_Price          1006 non-null   object  
 12  Price              6019 non-null   float64 
dtypes: category(5), float64(2), int64(2), object(4)
memory usage: 633.9+ KB

Observations:

  1. New_Price has many missing values and we will need to handle this.
  2. Price has a number of missing values and will need to be pre-processed in order to make it into our model.
  3. In all, not many missing values and a simple small data-set at 7k+ records.

Numerics

Before I can go on to do analysis I need to make sure the numeric types are set correctly by removing string values from them. These units of measurement are important for understanding the data so I will move them to the column headers to keep them top of mind when doing insights and recommendations.

In [12]:
# looking at value counts for non-categorical features

num_to_display = 10  # defining this up here so it's easy to change later if I want
for colname in df.dtypes[df.dtypes == "object"].index:  # call the mixed-dtype class
    val_counts = df[colname].value_counts(dropna=False)  # i want to see NA counts
    print(colname, "\n", "\n", val_counts[:num_to_display])
    if len(val_counts) > num_to_display:
        print(f"Only displaying first {num_to_display} of {len(val_counts)} values.")
    print("\n\n")  # just for more space
Mileage 
 
 17.0 kmpl     207
18.9 kmpl     201
18.6 kmpl     144
21.1 kmpl     106
20.36 kmpl    105
17.8 kmpl      98
18.0 kmpl      89
12.8 kmpl      87
18.5 kmpl      86
16.0 kmpl      85
Name: Mileage, dtype: int64
Only displaying first 10 of 451 values.



Engine 
 
 1197 CC    732
1248 CC    610
1498 CC    370
998 CC     309
1198 CC    281
2179 CC    278
1497 CC    273
1968 CC    266
1995 CC    212
1461 CC    188
Name: Engine, dtype: int64
Only displaying first 10 of 151 values.



Power 
 
 74 bhp       280
98.6 bhp     166
73.9 bhp     152
140 bhp      142
null bhp     129
78.9 bhp     128
67.1 bhp     126
67.04 bhp    125
82 bhp       124
88.5 bhp     120
Name: Power, dtype: int64
Only displaying first 10 of 387 values.



New_Price 
 
 NaN           6247
63.71 Lakh       6
33.36 Lakh       6
95.13 Lakh       6
4.78 Lakh        6
11.67 Lakh       5
20.74 Lakh       5
11.75 Lakh       5
4.98 Lakh        5
15.05 Lakh       5
Name: New_Price, dtype: int64
Only displaying first 10 of 626 values.



Observations:

  1. Looking at our non-numeric features we can see better that there are many numerics buried there that will need to be extracted and converted into numeric types by cleaning up the units of measurement out of them.

  2. Power has a value "null" which needs to be treated.

Power¶

Power is in bhp. We need to split the values into numeric and string so we will have two columns.

In [13]:
# change missing values in Power to actualy NaN values.
df["Power"] = df["Power"].replace("null bhp", np.nan)
"null bhp" in df.Power
Out[13]:
False
In [14]:
def floats(n):
    """In the Mileage and Power  columns I'm replacing the terminal 'km/kg','kmpl', 'bhp' and  with
    the empty string and converting to a float. Non-strings are
    np.nans and are kept as np.nans."""
    if isinstance(n, str):
        return float(n.replace(" bhp", ""))
    else:
        return np.nan
In [15]:
col_transforms = {"Power": floats}

# k is the key, so the column name here
# v is the value, which a function in this case and is
#     either `height_to_num` or `weight_to_num`
for k, v in col_transforms.items():
    df[k] = df[k].map(v)
In [16]:
# Look at the head
df.head()
Out[16]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Mileage Engine Power Seats New_Price Price
S.No.
0 Maruti Wagon R LXI CNG Mumbai 2010 72000 CNG Manual First 26.6 km/kg 998 CC 58.16 5.00 NaN 1.75
1 Hyundai Creta 1.6 CRDi SX Option Pune 2015 41000 Diesel Manual First 19.67 kmpl 1582 CC 126.20 5.00 NaN 12.50
2 Honda Jazz V Chennai 2011 46000 Petrol Manual First 18.2 kmpl 1199 CC 88.70 5.00 8.61 Lakh 4.50
3 Maruti Ertiga VDI Chennai 2012 87000 Diesel Manual First 20.77 kmpl 1248 CC 88.76 7.00 NaN 6.00
4 Audi A4 New 2.0 TDI Multitronic Coimbatore 2013 40670 Diesel Automatic Second 15.2 kmpl 1968 CC 140.80 5.00 NaN 17.74

Verify the unit matches the fuel type

In [17]:
# check that function worked
for i in col_transforms:
    display(df[i].head())
S.No.
0    58.16
1   126.20
2    88.70
3    88.76
4   140.80
Name: Power, dtype: float64
In [18]:
# print dtypes
df.dtypes
Out[18]:
Name                 category
Location             category
Year                    int64
Kilometers_Driven       int64
Fuel_Type            category
Transmission         category
Owner_Type           category
Mileage                object
Engine                 object
Power                 float64
Seats                 float64
New_Price              object
Price                 float64
dtype: object

Engine¶

In [19]:
# Strip and create new column
df["Engine_num"] = (
    df["Engine"].apply(lambda x: x.replace("CC", "").strip() if type(x) == str else x)
).astype(float)
In [20]:
# check head
df.head()
Out[20]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Mileage Engine Power Seats New_Price Price Engine_num
S.No.
0 Maruti Wagon R LXI CNG Mumbai 2010 72000 CNG Manual First 26.6 km/kg 998 CC 58.16 5.00 NaN 1.75 998.00
1 Hyundai Creta 1.6 CRDi SX Option Pune 2015 41000 Diesel Manual First 19.67 kmpl 1582 CC 126.20 5.00 NaN 12.50 1582.00
2 Honda Jazz V Chennai 2011 46000 Petrol Manual First 18.2 kmpl 1199 CC 88.70 5.00 8.61 Lakh 4.50 1199.00
3 Maruti Ertiga VDI Chennai 2012 87000 Diesel Manual First 20.77 kmpl 1248 CC 88.76 7.00 NaN 6.00 1248.00
4 Audi A4 New 2.0 TDI Multitronic Coimbatore 2013 40670 Diesel Automatic Second 15.2 kmpl 1968 CC 140.80 5.00 NaN 17.74 1968.00

We don't need the Engine column anymore since we only want the numbers.

In [21]:
# drop column
df.drop(columns="Engine", inplace=True, axis=1)

Engine_num and Seats are a float that is really a Int64 dtype (real number vs a float). First handle the missingness in each to allow conversion. We will use mean replacement.

In [22]:
numeric_columns = ["Engine_num"]

# for every item in the list print the mean
for i in numeric_columns:
    print(i, df[i].mean())

medianFiller = lambda x: x.fillna(
    x.median()
)  # use the median of the column to fill in na values

df[numeric_columns] = df[numeric_columns].apply(
    medianFiller, axis=0
)  # apply the median filling function

for i in numeric_columns:
    print("After filling in medians", i, df[i].mean())  # print the post-function mean
Engine_num 1616.5734702372692
After filling in medians Engine_num 1615.7897421756516

It has only changed by ~1 unit which is enough to be used.

In [23]:
# check for na values
df["Engine_num"].isna().sum()
Out[23]:
0
In [24]:
# dtype conversion
# df[["Engine_num","Seats"]].convert_dtypes(True,False,True,False,False)

df["Engine_num"] = df["Engine_num"].astype("int64", copy=False)
In [25]:
# check dtypes
df.dtypes
Out[25]:
Name                 category
Location             category
Year                    int64
Kilometers_Driven       int64
Fuel_Type            category
Transmission         category
Owner_Type           category
Mileage                object
Power                 float64
Seats                 float64
New_Price              object
Price                 float64
Engine_num              int64
dtype: object

It looks that Engine_num is correct. Now we need to look at Seats.

Seats¶

In [26]:
# Get missing
df.Seats.isna().sum()
Out[26]:
53

After checking for missing lets treat these values with standard mean replacement.

In [27]:
numeric_columns = ["Seats"]

# for every item in the list print the mean
for i in numeric_columns:
    print(i, df[i].mean())

medianFiller = lambda x: x.fillna(
    x.median()
)  # use the median of the column to fill in na values

df[numeric_columns] = df[numeric_columns].apply(
    medianFiller, axis=0
)  # apply the median filling function

for i in numeric_columns:
    print("After filling in medians", i, df[i].mean())  # print the post-function mean
Seats 5.279722222222222
After filling in medians Seats 5.277678202123259

Hardly changed the mean. We will use.

In [28]:
# check dtype
df.Seats = df.Seats.astype("int64", copy=False)
df.Seats.dtype
Out[28]:
dtype('int64')

That has been fixed. Now we need to continue with the parsing of object columns.

Mileage¶

For Mileage we need to get the units into another column and numerics into another still. kmpl - kilometers per litre - is used for petrol and diesel cars. km/kg - kilometers per kg - is used for CNG and LPG based engines. We can check this in our data by comparing it to the Fuel_Type variable.

In [29]:
# split the name on number vs string
df["Mileage_num"] = (
    df["Mileage"].apply(
        lambda x: x.replace(" kmpl", "").replace(" km/kg", "").strip()
        if type(x) == str
        else x
    )
).astype(float)
In [30]:
# check the df
df.head()
Out[30]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Mileage Power Seats New_Price Price Engine_num Mileage_num
S.No.
0 Maruti Wagon R LXI CNG Mumbai 2010 72000 CNG Manual First 26.6 km/kg 58.16 5 NaN 1.75 998 26.60
1 Hyundai Creta 1.6 CRDi SX Option Pune 2015 41000 Diesel Manual First 19.67 kmpl 126.20 5 NaN 12.50 1582 19.67
2 Honda Jazz V Chennai 2011 46000 Petrol Manual First 18.2 kmpl 88.70 5 8.61 Lakh 4.50 1199 18.20
3 Maruti Ertiga VDI Chennai 2012 87000 Diesel Manual First 20.77 kmpl 88.76 7 NaN 6.00 1248 20.77
4 Audi A4 New 2.0 TDI Multitronic Coimbatore 2013 40670 Diesel Automatic Second 15.2 kmpl 140.80 5 NaN 17.74 1968 15.20
In [31]:
# extract text
df["Mileage_unit"] = (
    df["Mileage"].str.extract("([a-zA-Z ]+)", expand=False).str.strip()
).astype("category")
df.head()
Out[31]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Mileage Power Seats New_Price Price Engine_num Mileage_num Mileage_unit
S.No.
0 Maruti Wagon R LXI CNG Mumbai 2010 72000 CNG Manual First 26.6 km/kg 58.16 5 NaN 1.75 998 26.60 km
1 Hyundai Creta 1.6 CRDi SX Option Pune 2015 41000 Diesel Manual First 19.67 kmpl 126.20 5 NaN 12.50 1582 19.67 kmpl
2 Honda Jazz V Chennai 2011 46000 Petrol Manual First 18.2 kmpl 88.70 5 8.61 Lakh 4.50 1199 18.20 kmpl
3 Maruti Ertiga VDI Chennai 2012 87000 Diesel Manual First 20.77 kmpl 88.76 7 NaN 6.00 1248 20.77 kmpl
4 Audi A4 New 2.0 TDI Multitronic Coimbatore 2013 40670 Diesel Automatic Second 15.2 kmpl 140.80 5 NaN 17.74 1968 15.20 kmpl
In [32]:
# Let us check if the units correspond to the fuel types as expected.
df.groupby(by=["Fuel_Type", "Mileage_unit"]).size()
Out[32]:
Fuel_Type  Mileage_unit
CNG        km                62
           kmpl               0
Diesel     km                 0
           kmpl            3852
Electric   km                 0
           kmpl               0
LPG        km                12
           kmpl               0
Petrol     km                 0
           kmpl            3325
dtype: int64

As expected, km/kg is for CNG/LPG cars and kmpl is for Petrol and Diesel cars.

Now that Mileage is separated we can drop the original column and the unit column knowing how to pair them to Fuel_Type.

In [33]:
# drop column
df.drop(labels=["Mileage", "Mileage_unit"], inplace=True, axis=1)
In [34]:
# check columns
df.columns
Out[34]:
Index(['Name', 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type',
       'Transmission', 'Owner_Type', 'Power', 'Seats', 'New_Price', 'Price',
       'Engine_num', 'Mileage_num'],
      dtype='object')
In [35]:
# check conversion
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7253 entries, 0 to 7252
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Name               7253 non-null   category
 1   Location           7253 non-null   category
 2   Year               7253 non-null   int64   
 3   Kilometers_Driven  7253 non-null   int64   
 4   Fuel_Type          7253 non-null   category
 5   Transmission       7253 non-null   category
 6   Owner_Type         7253 non-null   category
 7   Power              7078 non-null   float64 
 8   Seats              7253 non-null   int64   
 9   New_Price          1006 non-null   object  
 10  Price              6019 non-null   float64 
 11  Engine_num         7253 non-null   int64   
 12  Mileage_num        7251 non-null   float64 
dtypes: category(5), float64(3), int64(4), object(1)
memory usage: 891.9+ KB

Data Consistency¶

Sanity¶

Remove unneeded car types: "Electric" is a new market and will sway the pricing.

In [36]:
df.Fuel_Type.value_counts()
Out[36]:
Diesel      3852
Petrol      3325
CNG           62
LPG           12
Electric       2
Name: Fuel_Type, dtype: int64
In [37]:
# subset columns
elec = df.loc[df.Fuel_Type == "Electric"]
display(elec)
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats New_Price Price Engine_num Mileage_num
S.No.
4446 Mahindra E Verito D4 Chennai 2016 50000 Electric Automatic First 41.00 5 13.58 Lakh 13.00 72 NaN
4904 Toyota Prius 2009-2016 Z4 Mumbai 2011 44000 Electric Automatic First 73.00 5 NaN 12.75 1798 NaN
In [38]:
# drop rows
df.drop(labels=[4446, 4904], axis=0, inplace=True)
In [39]:
# check drop
df.loc[df.Fuel_Type == "Electric"]
Out[39]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats New_Price Price Engine_num Mileage_num
S.No.
In [40]:
# check the drop
df.Fuel_Type.value_counts().to_frame()
Out[40]:
Fuel_Type
Diesel 3852
Petrol 3325
CNG 62
LPG 12
Electric 0
In [41]:
# update the Catgorical dtype of Fuel_Type (https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)
# Values which are removed are replaced by np.nan.
display(df.Fuel_Type.cat.categories)
df.Fuel_Type = df.Fuel_Type.cat.remove_categories(["Electric"])
display(df.Fuel_Type.cat.categories)
Index(['CNG', 'Diesel', 'Electric', 'LPG', 'Petrol'], dtype='object')
Index(['CNG', 'Diesel', 'LPG', 'Petrol'], dtype='object')

Mileage may have odd values in it. Let's check

Extreme values in Mileage

In [42]:
# check the minimum
df.Mileage_num.min()
Out[42]:
0.0
In [43]:
# get subset
mil = df.loc[df.Mileage_num == df.Mileage_num.min()]
display(mil)
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats New_Price Price Engine_num Mileage_num
S.No.
14 Land Rover Freelander 2 TD4 SE Pune 2012 85000 Diesel Automatic Second 115.00 5 NaN 17.50 2179 0.00
67 Mercedes-Benz C-Class Progressive C 220d Coimbatore 2019 15369 Diesel Automatic First 194.00 5 49.14 Lakh 35.67 1950 0.00
79 Hyundai Santro Xing XL Hyderabad 2005 87591 Petrol Manual First NaN 5 NaN 1.30 1086 0.00
194 Honda City 1.5 GXI Ahmedabad 2007 60006 Petrol Manual First NaN 5 NaN 2.95 1493 0.00
229 Ford Figo Diesel Bangalore 2015 70436 Diesel Manual First 99.00 5 NaN 3.60 1498 0.00
262 Hyundai Santro Xing XL Hyderabad 2006 99000 Petrol Manual First NaN 5 NaN 1.75 1086 0.00
307 Hyundai Santro Xing XL Chennai 2006 58000 Petrol Manual Second NaN 5 NaN 1.50 1086 0.00
424 Volkswagen Jetta 2007-2011 1.9 L TDI Hyderabad 2010 42021 Diesel Manual First NaN 5 NaN 5.45 1968 0.00
443 Hyundai Santro GLS I - Euro I Coimbatore 2012 50243 Petrol Manual First NaN 5 NaN 3.35 1086 0.00
544 Mercedes-Benz New C-Class Progressive C 200 Kochi 2019 13190 Petrol Automatic First 181.43 5 49.49 Lakh 38.99 1950 0.00
631 Hyundai Santro LS zipPlus Chennai 2002 70000 Petrol Manual Third NaN 5 NaN 0.65 1086 0.00
647 Hyundai Santro Xing XP Jaipur 2004 200000 Petrol Manual First NaN 5 NaN 0.80 1086 0.00
707 Mercedes-Benz M-Class ML 350 4Matic Pune 2014 120000 Diesel Automatic First 165.00 5 NaN 30.00 2987 0.00
749 Land Rover Range Rover 3.0 D Mumbai 2008 55001 Diesel Automatic Second NaN 5 NaN 26.50 1493 0.00
915 Smart Fortwo CDI AT Pune 2008 103000 Diesel Automatic Second NaN 2 NaN 3.00 799 0.00
962 Mercedes-Benz C-Class Progressive C 220d Mumbai 2018 8682 Diesel Automatic First 194.00 5 52.26 Lakh 39.50 1950 0.00
996 Hyundai Santro Xing GL Pune 2008 93000 Petrol Manual First 62.00 5 NaN 1.45 1086 0.00
1059 Hyundai Santro Xing GL Hyderabad 2010 58163 Petrol Manual First 62.00 5 NaN 2.45 1086 0.00
1259 Land Rover Freelander 2 TD4 S Bangalore 2010 125000 Diesel Automatic Second 115.00 5 NaN 11.00 2179 0.00
1271 Hyundai Santro GLS I - Euro II Jaipur 2009 89000 Petrol Manual Second NaN 5 NaN 1.60 999 0.00
1308 Mercedes-Benz M-Class ML 350 4Matic Bangalore 2014 33000 Diesel Automatic Second 165.00 5 NaN 43.00 2987 0.00
1345 Maruti Baleno Vxi Pune 2005 70000 Petrol Manual First NaN 5 NaN 1.30 1590 0.00
1354 Hyundai Santro Xing GL Kochi 2011 20842 Petrol Manual First 62.00 5 NaN 2.78 1086 0.00
1385 Honda City 1.5 GXI Pune 2004 115000 Petrol Manual Second NaN 5 NaN 1.50 1493 0.00
1419 Hyundai Santro Xing XL Chennai 2007 82000 Petrol Manual Second NaN 5 NaN 1.35 1086 0.00
1460 Land Rover Range Rover Sport 2005 2012 Sport Coimbatore 2008 69078 Petrol Manual First NaN 5 NaN 40.88 1493 0.00
1764 Mercedes-Benz M-Class ML 350 4Matic Pune 2015 69000 Diesel Automatic First 165.00 5 NaN 38.00 2987 0.00
1857 Hyundai Santro DX Hyderabad 2007 96000 Petrol Manual Second NaN 5 NaN 2.20 999 0.00
2053 Mahindra Jeep MM 550 PE Hyderabad 2009 26000 Diesel Manual First NaN 6 NaN 6.99 2498 0.00
2096 Hyundai Santro LP zipPlus Coimbatore 2004 52146 Petrol Manual First NaN 5 NaN 1.93 1493 0.00
2130 Hyundai Santro GLS I - Euro II Coimbatore 2012 51019 Petrol Manual First NaN 5 NaN 3.48 999 0.00
2267 Toyota Qualis RS E2 Pune 2004 215750 Diesel Manual Second NaN 10 NaN 3.50 2446 0.00
2343 Hyundai Santro AT Hyderabad 2006 74483 Petrol Automatic First NaN 5 NaN 2.30 999 0.00
2542 Hyundai Santro GLS II - Euro II Bangalore 2011 65000 Petrol Manual Second NaN 5 NaN 3.15 1493 0.00
2597 Hyundai Santro Xing XP Pune 2007 70000 Petrol Manual First NaN 5 NaN 1.12 1086 0.00
2681 Skoda Superb 3.6 V6 FSI Hyderabad 2010 54000 Petrol Automatic First 262.60 5 NaN 6.00 3597 0.00
2780 Hyundai Santro GLS II - Euro II Pune 2009 100000 Petrol Manual First NaN 5 NaN 1.60 1493 0.00
2842 Hyundai Santro GLS II - Euro II Bangalore 2012 43000 Petrol Manual First NaN 5 NaN 3.25 1493 0.00
3033 Hyundai Santro Xing XP Jaipur 2005 120000 Petrol Manual First NaN 5 NaN 1.15 1086 0.00
3044 Hyundai Santro Xing GL Kolkata 2009 60170 Petrol Manual First 62.00 5 NaN 1.15 1086 0.00
3061 Hyundai Santro GS Ahmedabad 2005 58000 Petrol Manual Second NaN 5 NaN 1.51 999 0.00
3093 Audi A7 2011-2015 Sportback Kolkata 2012 24720 Diesel Automatic Second 241.40 5 NaN 28.00 2967 0.00
3189 Hyundai Santro GS zipDrive - Euro II Chennai 2002 67000 Petrol Manual Third NaN 5 NaN 1.20 999 0.00
3210 Mercedes-Benz M-Class ML 350 4Matic Coimbatore 2016 22769 Diesel Automatic First 165.00 5 NaN 49.22 2987 0.00
3271 Hyundai Santro Xing GL Bangalore 2008 35268 Petrol Manual Second 62.00 5 NaN 1.88 1086 0.00
3516 Hyundai Santro GLS I - Euro I Pune 2011 65400 Petrol Manual First NaN 5 NaN 2.10 1086 0.00
3522 Hyundai Santro GLS II - Euro II Kochi 2012 66400 Petrol Manual First NaN 5 NaN 2.66 1493 0.00
3645 Hyundai Santro Xing XP Bangalore 2004 167000 Petrol Manual First NaN 5 NaN 1.35 1086 0.00
4152 Land Rover Range Rover 3.0 D Mumbai 2003 75000 Diesel Automatic Second NaN 5 NaN 16.11 1493 0.00
4234 Mercedes-Benz M-Class ML 350 4Matic Chennai 2012 63000 Diesel Automatic First 165.00 5 NaN 26.00 2987 0.00
4302 Hyundai Santro Xing GL Delhi 2012 61449 Petrol Manual First 62.00 5 NaN 2.15 1086 0.00
4412 Mercedes-Benz M-Class ML 350 4Matic Coimbatore 2016 27833 Diesel Automatic First 165.00 5 NaN 49.24 2987 0.00
4629 Fiat Siena 1.2 ELX Jaipur 2001 70000 Petrol Manual Third NaN 5 NaN 0.55 1242 0.00
4687 Land Rover Freelander 2 TD4 SE Jaipur 2012 119203 Diesel Automatic First 115.00 5 NaN 16.50 2179 0.00
4704 Mercedes-Benz M-Class ML 350 4Matic Bangalore 2015 20000 Diesel Automatic First 165.00 5 NaN 47.35 2987 0.00
5016 Land Rover Freelander 2 TD4 HSE Delhi 2013 72000 Diesel Automatic First 115.00 5 NaN 15.50 2179 0.00
5022 Land Rover Freelander 2 TD4 SE Hyderabad 2013 46000 Diesel Automatic Second 115.00 5 NaN 26.00 2179 0.00
5119 Hyundai Santro Xing XP Kolkata 2008 45500 Petrol Manual Second NaN 5 NaN 1.17 1086 0.00
5270 Honda City 1.5 GXI Bangalore 2002 53000 Petrol Manual Second NaN 5 NaN 1.85 1493 0.00
5311 Land Rover Freelander 2 TD4 SE Hyderabad 2012 139000 Diesel Automatic First 115.00 5 NaN 16.75 2179 0.00
5374 Mercedes-Benz M-Class ML 350 4Matic Ahmedabad 2012 66000 Diesel Automatic First 165.00 5 NaN 23.00 2987 0.00
5426 Hyundai Santro Xing XL Chennai 2006 85000 Petrol Manual Third NaN 5 NaN 1.30 1086 0.00
5529 Hyundai Santro LP - Euro II Chennai 2005 105000 Petrol Manual First NaN 5 NaN 1.75 999 0.00
5647 Toyota Qualis Fleet A3 Mumbai 2001 227000 Diesel Manual Fourth & Above NaN 8 NaN 2.20 2446 0.00
5875 Mercedes-Benz C-Class Progressive C 220d Ahmedabad 2019 4000 Diesel Automatic First 194.00 5 49.14 Lakh 35.00 1950 0.00
5943 Mahindra Jeep MM 540 DP Chennai 2002 75000 Diesel Manual First NaN 6 NaN 1.70 2112 0.00
5972 Hyundai Santro Xing GL Mumbai 2008 65000 Petrol Manual Second 62.00 5 NaN 1.39 1086 0.00
6011 Skoda Superb 3.6 V6 FSI Hyderabad 2009 53000 Petrol Automatic First 262.60 5 NaN 4.75 3597 0.00
6090 Hyundai Santro Xing GL Ahmedabad 2013 63831 Petrol Manual First 62.00 5 NaN NaN 1086 0.00
6093 Hyundai Santro Xing XL Bangalore 2007 47000 Petrol Manual Second NaN 5 NaN NaN 1086 0.00
6177 Mercedes-Benz M-Class ML 350 4Matic Bangalore 2012 37000 Diesel Automatic First 165.00 5 NaN NaN 2987 0.00
6205 Hyundai Santro Xing GL Ahmedabad 2007 78000 Petrol Manual First 62.00 5 NaN NaN 1086 0.00
6439 Hyundai Santro GLS I - Euro II Bangalore 2011 43189 Petrol Manual First NaN 5 NaN NaN 999 0.00
6454 Hyundai Santro LS zipDrive Euro I Chennai 2002 120000 Petrol Manual Fourth & Above NaN 5 NaN NaN 999 0.00
6491 Mercedes-Benz M-Class ML 350 4Matic Coimbatore 2016 22177 Diesel Automatic First 165.00 5 NaN NaN 2987 0.00
6576 Hyundai Santro LS zipPlus Kolkata 2002 80000 Petrol Manual First NaN 5 NaN NaN 1086 0.00
6633 Mahindra TUV 300 P4 Kolkata 2016 27000 Diesel Manual First NaN 5 NaN NaN 1493 0.00
6697 Hyundai Santro Xing XL Jaipur 2007 85000 Petrol Manual Second NaN 5 NaN NaN 1086 0.00
6857 Land Rover Freelander 2 TD4 SE Mumbai 2011 87000 Diesel Automatic First 115.00 5 NaN NaN 2179 0.00
6957 Honda Jazz 2020 Petrol Kochi 2019 11574 Petrol Manual First 88.70 5 NaN NaN 1199 0.00
7226 Hyundai Santro Xing GL Ahmedabad 2014 41000 Petrol Manual First 62.00 5 NaN NaN 1086 0.00

There should be no 0's mpkm. This could be a data-entry error or system mistake. Let's investigate further.

Let's treat them as missing and handle all missing values later.

In [44]:
# replace with np.nan
df["Mileage_num"] = df["Mileage_num"].replace(df.Mileage_num.min(), np.nan)
In [45]:
# subset with loc
df.loc[df.Mileage_num == df.Mileage_num.min()]
Out[45]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats New_Price Price Engine_num Mileage_num
S.No.
5781 Lamborghini Gallardo Coupe Delhi 2011 6500 Petrol Automatic Third 560.00 2 NaN 120.00 5204 6.40

Missing Values¶

There are lots of ways to handle missing values. I'm going to start by investigating the patterns in the missingness.

In [46]:
# looking at which columns have the most missing values
df.isna().sum().sort_values(ascending=False)
Out[46]:
New_Price            6246
Price                1234
Power                 175
Mileage_num            81
Name                    0
Location                0
Year                    0
Kilometers_Driven       0
Fuel_Type               0
Transmission            0
Owner_Type              0
Seats                   0
Engine_num              0
dtype: int64

Calculate the percentage of the data points missing per variable and if any variable has more than 50% missing values in it those are generally not useful and dropping those variables instead is recommended.

In [47]:
# Find percentage missing per column
df.isna().mean().sort_values(ascending=False).round(4) * 100
Out[47]:
New_Price           86.14
Price               17.02
Power                2.41
Mileage_num          1.12
Name                 0.00
Location             0.00
Year                 0.00
Kilometers_Driven    0.00
Fuel_Type            0.00
Transmission         0.00
Owner_Type           0.00
Seats                0.00
Engine_num           0.00
dtype: float64
In [48]:
# counting the number of missing values per row
df.isnull().sum(axis=1).value_counts()
Out[48]:
1    5212
2    1142
0     819
3      72
4       6
dtype: int64

There are 32 rows with 3 missing values, and 1163 with atleast 2. This is quite messy still and will need correcting.

In [49]:
# print a df of the null values
null_data = df[df.isnull().any(axis=1)]
display(null_data)
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats New_Price Price Engine_num Mileage_num
S.No.
0 Maruti Wagon R LXI CNG Mumbai 2010 72000 CNG Manual First 58.16 5 NaN 1.75 998 26.60
1 Hyundai Creta 1.6 CRDi SX Option Pune 2015 41000 Diesel Manual First 126.20 5 NaN 12.50 1582 19.67
3 Maruti Ertiga VDI Chennai 2012 87000 Diesel Manual First 88.76 7 NaN 6.00 1248 20.77
4 Audi A4 New 2.0 TDI Multitronic Coimbatore 2013 40670 Diesel Automatic Second 140.80 5 NaN 17.74 1968 15.20
5 Hyundai EON LPG Era Plus Option Hyderabad 2012 75000 LPG Manual First 55.20 5 NaN 2.35 814 21.10
... ... ... ... ... ... ... ... ... ... ... ... ... ...
7248 Volkswagen Vento Diesel Trendline Hyderabad 2011 89411 Diesel Manual First 103.60 5 NaN NaN 1598 20.54
7249 Volkswagen Polo GT TSI Mumbai 2015 59000 Petrol Automatic First 103.60 5 NaN NaN 1197 17.21
7250 Nissan Micra Diesel XV Kolkata 2012 28000 Diesel Manual First 63.10 5 NaN NaN 1461 23.08
7251 Volkswagen Polo GT TSI Pune 2013 52262 Petrol Automatic Third 103.60 5 NaN NaN 1197 17.20
7252 Mercedes-Benz E-Class 2009-2013 E 220 CDI Avan... Kochi 2014 72443 Diesel Automatic First 170.00 5 NaN NaN 2148 10.00

6432 rows × 13 columns

Re-check missing after dropping.

In [50]:
# create a list of columns with empty, or na values
nulls = df.columns[df.isna().any().tolist()]
In [51]:
# looking at which columns have the most missing values
df.isna().sum().sort_values(ascending=False)
Out[51]:
New_Price            6246
Price                1234
Power                 175
Mileage_num            81
Name                    0
Location                0
Year                    0
Kilometers_Driven       0
Fuel_Type               0
Transmission            0
Owner_Type              0
Seats                   0
Engine_num              0
dtype: int64

Observations

In [52]:
# Find percentage missing per column
df.isna().mean().sort_values(ascending=False).round(4) * 100
Out[52]:
New_Price           86.14
Price               17.02
Power                2.41
Mileage_num          1.12
Name                 0.00
Location             0.00
Year                 0.00
Kilometers_Driven    0.00
Fuel_Type            0.00
Transmission         0.00
Owner_Type           0.00
Seats                0.00
Engine_num           0.00
dtype: float64

Observations:

  1. The New_Price column has a lot of missing values. These could be cars that do not have an equivalent model on the new-car market. Since I'm not particularly interested in this column as the description of the project states that these two markets are very different and I can't confidently say that there is a strong relationship between new car price and used car price (even though in common knowledge there is a jolt in used car sales when new car prices are high), I'll drop it for the sake of our analysis.

  2. Power will need to be fixed as well.

Imputing Missing Values¶

New_Price

In [53]:
# Drop any unneeded columns
df.drop(["New_Price"], axis=1, inplace=True)
In [54]:
df.isna().mean().sort_values(ascending=False).round(4) * 100
Out[54]:
Price               17.02
Power                2.41
Mileage_num          1.12
Name                 0.00
Location             0.00
Year                 0.00
Kilometers_Driven    0.00
Fuel_Type            0.00
Transmission         0.00
Owner_Type           0.00
Seats                0.00
Engine_num           0.00
dtype: float64
Standard Replacement¶

We will replace missing values in every needed column with its median increasing it's distortion <5% shouldnt matter (or drop all together).

Including Price here as opposed to dropping it or filling in with mean to not skew to outliers, even though it has 16% blanks, I want to keep as many data points as possible for the modeling.

In [55]:
# count the number missing
df.Price.isna().sum()
Out[55]:
1234
In [56]:
numeric_columns = ["Power", "Price", "Mileage_num"]

for i in numeric_columns:
    print(i, df[i].mean())

medianFiller = lambda x: x.fillna(x.median())

df[numeric_columns] = df[numeric_columns].apply(medianFiller, axis=0)

for i in numeric_columns:
    print("After filling in medians", i, df[i].mean())
Power 112.78097583380348
Price 9.478339704171498
Mileage_num 18.346527196652747
After filling in medians Power 112.33591021927913
After filling in medians Price 8.825117914770264
After filling in medians Mileage_num 18.344890359950362

Very little difference after missing value filling. Price moved down .6 Lakhs.

Duplicates¶

In [57]:
# check for duplicated items
df.duplicated(keep=False).value_counts()
Out[57]:
False    7249
True        2
dtype: int64
In [58]:
# dropping duplicate rows
df.drop_duplicates(keep=False, inplace=True)
In [59]:
df[df.duplicated(keep=False)]  # check drop
Out[59]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats Price Engine_num Mileage_num
S.No.

No duplicates found.

In [60]:
# check the number of unique values in each column of the dataframe
df.nunique().sort_values(ascending=False)
Out[60]:
Kilometers_Driven    3660
Name                 2039
Price                1373
Mileage_num           437
Power                 383
Engine_num            149
Year                   23
Location               11
Seats                   9
Fuel_Type               4
Owner_Type              4
Transmission            2
dtype: int64
In [61]:
# View the unique items in each row
for i in df.columns:
    display(i, pd.DataFrame(df[i].unique()))
'Name'
0
0 Maruti Wagon R LXI CNG
1 Hyundai Creta 1.6 CRDi SX Option
2 Honda Jazz V
3 Maruti Ertiga VDI
4 Audi A4 New 2.0 TDI Multitronic
... ...
2034 Skoda Superb Petrol Ambition
2035 Tata Tiago 1.05 Revotorq XT Option
2036 Ford EcoSport 1.5 Petrol Ambiente
2037 Jeep Compass 1.4 Sport
2038 Hyundai Elite i20 Magna Plus

2039 rows × 1 columns

'Location'
0
0 Mumbai
1 Pune
2 Chennai
3 Coimbatore
4 Hyderabad
5 Jaipur
6 Kochi
7 Kolkata
8 Delhi
9 Bangalore
10 Ahmedabad
'Year'
0
0 2010
1 2015
2 2011
3 2012
4 2013
5 2016
6 2018
7 2014
8 2017
9 2007
10 2009
11 2008
12 2019
13 2006
14 2005
15 2004
16 2002
17 2000
18 2003
19 1999
20 2001
21 1998
22 1996
'Kilometers_Driven'
0
0 72000
1 41000
2 46000
3 87000
4 40670
... ...
3655 18242
3656 21190
3657 89411
3658 52262
3659 72443

3660 rows × 1 columns

'Fuel_Type'
0
0 CNG
1 Diesel
2 Petrol
3 LPG
'Transmission'
0
0 Manual
1 Automatic
'Owner_Type'
0
0 First
1 Second
2 Fourth & Above
3 Third
'Power'
0
0 58.16
1 126.20
2 88.70
3 88.76
4 140.80
... ...
378 147.40
379 328.50
380 98.63
381 360.00
382 241.60

383 rows × 1 columns

'Seats'
0
0 5
1 7
2 8
3 4
4 6
5 2
6 10
7 9
8 0
'Price'
0
0 1.75
1 12.50
2 4.50
3 6.00
4 17.74
... ...
1368 1.39
1369 30.54
1370 2.27
1371 17.56
1372 7.43

1373 rows × 1 columns

'Engine_num'
0
0 998
1 1582
2 1199
3 1248
4 1968
5 814
6 1461
7 2755
8 1598
9 1462
10 1497
11 2179
12 2477
13 1498
14 2143
15 1995
16 1984
17 1197
18 2494
19 1798
20 2696
21 2698
22 1061
23 1198
24 2987
25 796
26 624
27 1999
28 1991
29 2694
30 1120
31 2498
32 799
33 2393
34 1399
35 1796
36 2148
37 1396
38 1950
39 4806
40 1998
41 1086
42 1193
43 2982
44 1493
45 2967
46 2993
47 1196
48 1799
49 2497
50 2354
51 1373
52 2996
53 1591
54 2894
55 5461
56 1595
57 936
58 1997
59 1896
60 1390
61 1364
62 2199
63 993
64 999
65 1405
66 2956
67 1794
68 995
69 2496
70 1599
71 2400
72 1495
73 2523
74 793
75 4134
76 1596
77 1395
78 2953
79 1586
80 2362
81 1496
82 1368
83 1298
84 1956
85 1299
86 3498
87 2835
88 1150
89 3198
90 1343
91 1499
92 1186
93 1590
94 2609
95 2499
96 2446
97 1978
98 2360
99 3436
100 2198
101 4367
102 2706
103 1422
104 2979
105 1969
106 1489
107 2489
108 1242
109 1388
110 1172
111 2495
112 1194
113 3200
114 1781
115 1341
116 2773
117 3597
118 1985
119 2147
120 1047
121 2999
122 2995
123 2997
124 1948
125 2359
126 4395
127 2349
128 2720
129 1468
130 3197
131 2487
132 1597
133 2771
134 4951
135 970
136 2925
137 2200
138 5000
139 2149
140 5998
141 2092
142 5204
143 2112
144 1797
145 2000
146 1795
147 3696
148 1389
'Mileage_num'
0
0 26.60
1 19.67
2 18.20
3 20.77
4 15.20
... ...
432 11.88
433 12.08
434 15.56
435 14.50
436 17.15

437 rows × 1 columns

Observations:

  1. Name has many unique values as expected.
  2. We have some simple options in Transmission, Owner_Type, and Fuel_type that will make visual analysis simpler.

Distribution and Skewness¶

Log Transformations¶

Since we are using a Linear Regression model which assumes normality, I will transform variables with high skewness. H

If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.

If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed.

If the skewness is less than -1 or greater than 1, the data are highly skewed.

In [62]:
# Subset the dataframe based on numeric and categorical types
numerical_features = df.select_dtypes(include=["int64", "float64"])
global numerical_features
[numerical_features]
Out[62]:
[       Year  Kilometers_Driven  Power  Seats  Price  Engine_num  Mileage_num
 S.No.                                                                       
 0      2010              72000  58.16      5   1.75         998        26.60
 1      2015              41000 126.20      5  12.50        1582        19.67
 2      2011              46000  88.70      5   4.50        1199        18.20
 3      2012              87000  88.76      7   6.00        1248        20.77
 4      2013              40670 140.80      5  17.74        1968        15.20
 ...     ...                ...    ...    ...    ...         ...          ...
 7248   2011              89411 103.60      5   5.64        1598        20.54
 7249   2015              59000 103.60      5   5.64        1197        17.21
 7250   2012              28000  63.10      5   5.64        1461        23.08
 7251   2013              52262 103.60      5   5.64        1197        17.20
 7252   2014              72443 170.00      5   5.64        2148        10.00
 
 [7249 rows x 7 columns]]
In [63]:
# get skew
numerical_features.skew()
Out[63]:
Year                -0.84
Kilometers_Driven   61.57
Power                2.00
Seats                1.91
Price                3.73
Engine_num           1.42
Mileage_num          0.21
dtype: float64
In [64]:
# get normality check: std of 1 and mean of 0
display("mean", numerical_features.mean())
display("std", numerical_features.std())
'mean'
Year                 2013.37
Kilometers_Driven   58704.14
Power                 112.33
Seats                   5.28
Price                   8.83
Engine_num           1616.01
Mileage_num            18.35
dtype: float64
'std'
Year                    3.25
Kilometers_Driven   84450.70
Power                  52.93
Seats                   0.81
Price                  10.30
Engine_num            593.35
Mileage_num             4.14
dtype: float64
In [65]:
cols_to_log = [
    "Year",
    "Kilometers_Driven",
    "Mileage_num",
    "Engine_num",
    "Power",
    "Seats",
]

# print histogram for each column
for colname in cols_to_log:
    plt.hist(df[colname], bins=50)
    plt.title(colname)
    plt.show()
    print(np.sum(df[colname] <= 0))  # check for values <0
0
0
0
0
0
1

Observations

  • Year, Engine_num, Powerand Kilometer_Driven have strong skewness.
  • Mileage_num andSeats look fairly nicely normally distributed.

Time to consider other transformations. Another option is to use np.arcsinh which is like the log for large values but handles negative and zero values as well.

Testing Transformations

In [66]:
high_skew = ["Year", "Kilometers_Driven", "Power", "Engine_num"]
med_skew = ["Mileage_num", "Seats"]

for i in high_skew:

    # on higher skew
    plt.hist(np.arcsinh(df[i]), 50)
    display(i)
    plt.title("arcsinh")
    plt.show()

for i in med_skew:

    # try a regular log transform
    plt.hist(np.log(df[i] + 1), 50)
    display(i)
    plt.title("log")
    plt.show()


# on weaker skewness
plt.hist(np.sqrt(df["Seats"]), 50)
plt.title("sqrt")
plt.show()
'Year'
'Kilometers_Driven'
'Power'
'Engine_num'
'Mileage_num'
'Seats'

Observations

Year mean has moved closer to 0 and our standard deviation closer to 1.

Price¶

In [67]:
# look at the distribution
sns.distplot(df.Price).set_title("Distribution of Price")
Out[67]:
Text(0.5, 1.0, 'Distribution of Price')
In [68]:
# create probability plot
stats.probplot(df.Price, dist="norm", plot=pylab)
plt.show()

Observations

  • Price does not follow a normal distribution.
  • The probability curve shows that it is not on a linear trend.

Decision: What would it look like if we transformed Price on a log-scale?

In [69]:
x_bar, sigma = norm.fit(df.Price)
print("\n mean = {:2f} and std-dev = {:2f}\n".format(x_bar, sigma))
 mean = 8.825997 and std-dev = 10.295087

In [70]:
# plot the distribution with log-distribution
sns.histplot(np.log(df.Price), kde=True)
plt.title("log")

plt.legend(
    ["Normal Dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )".format(x_bar, sigma)],
    loc="best",
)
plt.ylabel("Frequency")
plt.title("Distribution of Prices")
plt.show()

# QQ plot
fig = plt.figure()
res = stats.probplot(df.Price, plot=plt)
plt.show()

Observations: The log transformation decreases the scale of the distributions, even with the huge range of Price. It seems the outliers caused the log-transformed distributions to still be a bit skewed, but it is closer to normal than the original distribution.

In [71]:
# Log Transformation has definitely helped in reducing the skew
# Creating a new column with the transformed variable.
df["price_log"] = np.log(df["Price"])

Outliers & Treatment¶

Notes from GreatLearning:

"An outlier is a data point that is distant from other similar points.

Outliers in the data can distort predictions and affect the accuracy so it's important to flag them for review. This is especially the case with regression models.

The challenge with outlier detection is determining if a point is truly a problem or simply a large value. If a point is genuine then it is very important to keep it in the data as otherwise we're removing the most interesting pdata points. Regardless, it is essential to understand their impact on our predictive models and statistics, and the extent to which a small number of outlying points are dominating the fit of the model (for example, the mean is much more sensitive to outliers than the median). It is left to the best judgement of the investigator to decide whether treating outliers is necessary and how to go about it. Knowledge of domain and impact of the business problem tend to drive this decision. "

Detection using Z-Score¶

Notes from GreatLearning:

"The z-transformation used earlier can flag a point as being far away from the mean. If the data are normally distributed then we expect the vast majority of points to be within 3 standard deviations of the mean, which corresponds to a z score with an absolute value of at most 3.

If the data are not normal, however, the situation is more complicated."

In [72]:
# get a sub-df of independent variables to transform
trans = df[["Year", "Kilometers_Driven", "Mileage_num", "Engine_num", "Power", "Seats"]]
In [73]:
def z_transform(x):
    return (x - np.mean(x)) / np.std(x)

"""
for i in list(numerical_features):
    plt.hist(z_transform(df[i].values))
    plt.title('z-transformed normal data')
    plt.show()
"""
    
for (column_name, column_data) in trans.iteritems():
    plt.hist(z_transform(column_data))
    plt.title('z-transformed normal data')
    print(column_name)
    plt.show()
Year
Kilometers_Driven
Mileage_num
Engine_num
Power
Seats
In [74]:
numerical_features = df.select_dtypes(include=["int64", "float64"])
# numerical_features

All of the points in each plot are drawn from the exact same distribution, so it's not fair to call any of the points outliers in the sense of there being bad data.

It looks like every column tested has outliers using this method.

But depending on the distribution in question, we may have almost all of the z-scores between -3 and 3 or instead there could be extremely large values.

In [75]:
# set parameters for -3 to +3 cut-off
upper = numerical_features.mean() + 3 * numerical_features.std()
lower = numerical_features.mean() - 3 * numerical_features.std()

# create a new dataframe with those outliers
out_df = df[(numerical_features < upper) & (numerical_features > lower)]
display(out_df)
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats Price Engine_num Mileage_num price_log
S.No.
0 NaN NaN 2010.00 72000.00 NaN NaN NaN 58.16 5.00 1.75 998.00 26.60 0.56
1 NaN NaN 2015.00 41000.00 NaN NaN NaN 126.20 5.00 12.50 1582.00 19.67 2.53
2 NaN NaN 2011.00 46000.00 NaN NaN NaN 88.70 5.00 4.50 1199.00 18.20 1.50
3 NaN NaN 2012.00 87000.00 NaN NaN NaN 88.76 7.00 6.00 1248.00 20.77 1.79
4 NaN NaN 2013.00 40670.00 NaN NaN NaN 140.80 5.00 17.74 1968.00 15.20 2.88
... ... ... ... ... ... ... ... ... ... ... ... ... ...
7248 NaN NaN 2011.00 89411.00 NaN NaN NaN 103.60 5.00 5.64 1598.00 20.54 1.73
7249 NaN NaN 2015.00 59000.00 NaN NaN NaN 103.60 5.00 5.64 1197.00 17.21 1.73
7250 NaN NaN 2012.00 28000.00 NaN NaN NaN 63.10 5.00 5.64 1461.00 23.08 1.73
7251 NaN NaN 2013.00 52262.00 NaN NaN NaN 103.60 5.00 5.64 1197.00 17.20 1.73
7252 NaN NaN 2014.00 72443.00 NaN NaN NaN 170.00 5.00 5.64 2148.00 10.00 1.73

7249 rows × 13 columns

Now let's create a new column to do a z-score test with.

In [76]:
for i in numerical_features:
    df["zscore"] = (df[i] - df[i].mean()) / df[i].std()
df.head(4)
Out[76]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats Price Engine_num Mileage_num price_log zscore
S.No.
0 Maruti Wagon R LXI CNG Mumbai 2010 72000 CNG Manual First 58.16 5 1.75 998 26.60 0.56 -1.57
1 Hyundai Creta 1.6 CRDi SX Option Pune 2015 41000 Diesel Manual First 126.20 5 12.50 1582 19.67 2.53 0.90
2 Honda Jazz V Chennai 2011 46000 Petrol Manual First 88.70 5 4.50 1199 18.20 1.50 -0.38
3 Maruti Ertiga VDI Chennai 2012 87000 Diesel Manual First 88.76 7 6.00 1248 20.77 1.79 -0.02

Now let's look for rows that would need to be cut using this method.

In [77]:
display(df[df["zscore"] > 3].shape)
display(df[df["zscore"] < -3].shape)
(31, 14)
(12, 14)

We would lose a total of 22 rows using this method. I'd rather go column by column so I will drop this column.

In [78]:
df.drop("zscore", inplace=True, axis=1)

Detection using IQR¶

In [79]:
# Function from GreatLearning
"""
Another way to flag points as outliers is to compute the IQR, 
which is the interval going from the 1st quartile to the 3rd quartile of the data in question, 
and then flag a point for investigation if it is outside 1.5 * IQR.
"""


def frac_outside_1pt5_IQR(x):
    length = 1.5 * np.diff(np.quantile(x, [0.25, 0.75]))
    return np.mean(np.abs(x - np.median(x)) > length)


for (column_name, column_data) in numerical_features.iteritems():

    print(column_name, frac_outside_1pt5_IQR(column_data.to_numpy()) * 100)
Year 3.4487515519381984
Kilometers_Driven 6.387087874189544
Power 9.049524072285832
Seats 15.90564215753897
Price 16.73334253000414
Engine_num 7.366533314939992
Mileage_num 2.552076148434267
price_log 15.422816940267623

Seats and Price data has over 15% of the sample is flagged as outliers using this method.

Detection using 4* Interquartile range method: Hamel Method¶

In [80]:
for (column_name, column_data) in numerical_features.iteritems():
    quartiles = np.quantile(df[column_name][df[column_name].notnull()], [0.25, 0.75])
    column_name_4iqr = 4 * (quartiles[1] - quartiles[0])
    print(
        "\n",
        column_name,
        f"Q1 = {quartiles[0]}, Q3 = {quartiles[1]}, 4*IQR = {column_name_4iqr}",
    )

    # create a dataframe of each place where the difference between value and median for that variable are greater than 4*IQR for that variable
    outlier_column_name = df.loc[
        np.abs(df[column_name] - df[column_name].median()) > column_name_4iqr,
        column_name,
    ]

    # check len and values for each
    print("\n", outlier_column_name)
 Year Q1 = 2011.0, Q3 = 2016.0, 4*IQR = 20.0

 Series([], Name: Year, dtype: int64)

 Kilometers_Driven Q1 = 34000.0, Q3 = 73000.0, 4*IQR = 156000.0

 S.No.
29       262000
77       230000
181      216000
340      775000
358      620000
838      215000
1068     225000
1101     248000
1442     231673
1528     299322
1766     220000
1846     255000
1860     720000
1975     282000
2267     215750
2328    6500000
2346     234000
2823     480000
2988     240000
3092     480000
3649     300000
3713     232000
3894     250000
4491     445000
4719     250000
4792     242000
5647     227000
5800     210000
5834     250000
6919     290000
6921     350000
Name: Kilometers_Driven, dtype: int64

 Power Q1 = 77.0, Q3 = 138.03, 4*IQR = 244.12

 S.No.
70     500.00
134    362.07
148    444.00
152    362.90
418    367.00
589    364.90
1930   364.90
2095   362.07
2100   550.00
2978   394.30
3132   340.00
3341   402.00
4061   444.00
4342   364.90
4451   395.00
4627   450.00
4691   421.00
4722   387.30
4821   450.00
5088   503.00
5341   382.00
5521   552.00
5603   394.30
5781   560.00
5919   488.10
6186   364.90
6354   616.00
6960   362.07
7057   450.00
7117   360.00
Name: Power, dtype: float64

 Seats Q1 = 5.0, Q3 = 5.0, 4*IQR = 0.0

 S.No.
3       7
7       8
15      7
29      7
32      7
       ..
7175    7
7194    4
7198    7
7206    7
7223    7
Name: Seats, Length: 1153, dtype: int64

 Price Q1 = 3.85, Q3 = 8.4, 4*IQR = 18.200000000000003

 S.No.
13     27.00
19     28.00
38     28.00
62     26.70
67     35.67
        ... 
5927   45.52
5946   48.00
5970   26.76
5996   30.54
6008   45.00
Name: Price, Length: 553, dtype: float64

 Engine_num Q1 = 1198.0, Q3 = 1968.0, 4*IQR = 3080.0

 S.No.
70      4806
152     5461
2100    4806
2978    4806
4451    4951
4691    5461
4722    5461
5088    5000
5521    5998
5603    4806
5781    5204
5919    5000
6354    5998
Name: Engine_num, dtype: int64

 Mileage_num Q1 = 15.4, Q3 = 21.1, 4*IQR = 22.800000000000004

 Series([], Name: Mileage_num, dtype: float64)

 price_log Q1 = 1.3480731482996928, Q3 = 2.128231705849268, 4*IQR = 3.1206342301983003

 S.No.
4079   5.08
Name: price_log, dtype: float64

Observations

  1. Year,Engine and Power is not generating a series to view the values and so has none outside in the 4*IQR
  2. Kilometers_Driven has relatively few outliers and could simply drop them or replace with nan.
  3. Price and Seats have many outliers in this range and need to be dealt with without dropping or replacing with nans on all values.

Visualizing Outliers¶

In [81]:
for (column_name, column_data) in numerical_features.iteritems():
    plt.hist(column_data, 20)
    plt.title("Histogram")
    display(column_name)
    plt.show()

    sns.boxplot(column_data)
    plt.title("Boxplot")
    display(column_name)
    plt.show()
'Year'
'Year'
'Kilometers_Driven'
'Kilometers_Driven'
'Power'
'Power'
'Seats'
'Seats'
'Price'
'Price'
'Engine_num'
'Engine_num'
'Mileage_num'
'Mileage_num'
'price_log'
'price_log'

Kilometers_Driven, Power, Price "have a lot of values that are flagged as suspicious by the boxplot, but in the histogram we can see that the distribution is skewed so these points aren't inconsistent with the overall distribution of the data. Nevertheless, having a heavy tail means we might want to consider statistics less sensitive to large values, so e.g. the median may be a better measure of central tendancy."

In [82]:
# let's plot the boxplots of all columns to check for outliers
plt.figure(figsize=(20, 30))

for i, variable in enumerate(numerical_features):
    plt.subplot(5, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Let's check outliers against the target variable and see where they are. We see price, Kilometers_Driven and Power have heavy outliers. And some exist in every column.

In [83]:
# plot indepdents against dependent
for i, variable in enumerate(numerical_features):
    sns.scatterplot(x=df[variable], y=df.Price)
    plt.show()

Observations

Year, Seats, Mileage has outliers in the upper and lower left regions. Kilometers_Driven is different with outliers on either end.

  • There are many upper-end outliers in Power, Price, and Engine.

  • "We will treat these outliers as these might adversely affect the predictive power of linear model. However, in real life, these outliers may be due to non-linear pattern in the data or can be important information. Sometimes outliers in the independent variable can adversely impact the linear model. This can be checked by building the model with and without outliers and comparing the model performances."

Decision:

For the sake of trying to make the data fit a normal distribution more easily, I will go ahead and drop outliers from these keys columns.

Treatment¶

Notes from GreatLearning:

"If we decide that we do actually have some problematic outliers, we have a few options.

  • If the point seems truly nonsensical it may be best to treat it as missing

  • Alternatively, we could drop that observation or we could use statistics that are robust to outliers

  • Could replace with the max value from the column or a string value

It's often a good idea to examine the sensitivity to outliers by running an analysis with and without them. If they are because of the types of data then don't drop. But statistically, it is better to cut them off. "

Let's examine some of the maximum values that stand out as outliers. They could be data entry errors.

Extreme values in Kilometers_Driven.

In [84]:
# box plot
sns.boxplot(df.Kilometers_Driven)
Out[84]:
<AxesSubplot:xlabel='Kilometers_Driven'>
In [85]:
# view the highest and lowest values
df[~(df["Kilometers_Driven"] < 750000)]  # | (df["Kilometers_Driven"] > 41000)]
Out[85]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats Price Engine_num Mileage_num price_log
S.No.
340 Skoda Octavia Ambition Plus 2.0 TDI AT Kolkata 2013 775000 Diesel Automatic First 141.00 5 7.50 1968 19.30 2.01
2328 BMW X5 xDrive 30d M Sport Chennai 2017 6500000 Diesel Automatic First 258.00 5 65.00 2993 15.97 4.17

650k KM doesn't make sense in 4 year span. so we'll drop it.

In [86]:
# drop the high values by row
df.drop(labels=[2328], axis=0, inplace=True)

Check the drop.

In [87]:
# view the high values
df[~(df["Kilometers_Driven"] < 750000)]  # | (df["Kilometers_Driven"] > +15)]
Out[87]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats Price Engine_num Mileage_num price_log
S.No.
340 Skoda Octavia Ambition Plus 2.0 TDI AT Kolkata 2013 775000 Diesel Automatic First 141.00 5 7.50 1968 19.30 2.01

Fixing extreme values in Seats

In [88]:
df.sort_values(by=["Seats"], ascending=True)
Out[88]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats Price Engine_num Mileage_num price_log
S.No.
3999 Audi A4 3.2 FSI Tiptronic Quattro Hyderabad 2012 125000 Petrol Automatic First 94.34 0 18.00 3197 10.50 2.89
5781 Lamborghini Gallardo Coupe Delhi 2011 6500 Petrol Automatic Third 560.00 2 120.00 5204 6.40 4.79
4722 Mercedes-Benz SL-Class SL 500 Kolkata 2010 35000 Petrol Automatic First 387.30 2 29.50 5461 8.10 3.38
6842 Nissan 370Z AT Kolkata 2012 14850 Petrol Automatic First 328.50 2 5.64 3696 10.00 1.73
926 Porsche Cayman 2009-2012 S Hyderabad 2010 10000 Petrol Manual First 94.34 2 40.00 3436 9.00 3.69
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1907 Toyota Qualis FS B3 Bangalore 2002 63000 Diesel Manual Third 75.00 10 3.65 2446 13.10 1.29
2267 Toyota Qualis RS E2 Pune 2004 215750 Diesel Manual Second 94.34 10 3.50 2446 18.20 1.25
2575 Chevrolet Tavera LS B3 10 Seats BSIII Hyderabad 2015 120000 Diesel Manual First 80.00 10 5.50 2499 14.80 1.70
6242 Tata Sumo EX 10/7 Str BSII Chennai 2015 196000 Diesel Manual Second 68.00 10 5.64 1948 12.20 1.73
917 Tata Sumo DX Mumbai 2013 20000 Diesel Manual First 83.10 10 5.25 1978 14.07 1.66

7248 rows × 13 columns

Here we will drop the 0 value in Seats which is likely a data-entry error.

In [89]:
# drop the high value
df.drop(labels=3999, axis=0, inplace=True)
In [90]:
# sort df by column
df.sort_values(by=["Seats"], ascending=True)
Out[90]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats Price Engine_num Mileage_num price_log
S.No.
5781 Lamborghini Gallardo Coupe Delhi 2011 6500 Petrol Automatic Third 560.00 2 120.00 5204 6.40 4.79
6960 Mercedes-Benz SLC 43 AMG Coimbatore 2018 18338 Petrol Automatic First 362.07 2 5.64 2996 19.00 1.73
1078 Porsche Boxster S tiptronic Kolkata 2015 10512 Petrol Automatic First 265.00 2 64.00 2706 8.60 4.16
2305 Porsche Cayman 2009-2012 S tiptronic Mumbai 2011 8000 Petrol Automatic First 94.34 2 43.00 3436 9.00 3.76
5919 Jaguar F Type 5.0 V8 S Hyderabad 2015 8000 Petrol Automatic First 488.10 2 100.00 5000 12.50 4.61
... ... ... ... ... ... ... ... ... ... ... ... ... ...
6875 Toyota Qualis FS B3 Pune 2002 119613 Diesel Manual Second 75.00 10 5.64 2446 13.10 1.73
814 Toyota Qualis FS B2 Pune 2004 77757 Diesel Manual Second 75.00 10 3.50 2446 13.10 1.25
1907 Toyota Qualis FS B3 Bangalore 2002 63000 Diesel Manual Third 75.00 10 3.65 2446 13.10 1.29
6288 Chevrolet Tavera LS B3 10 Seats BSIII Hyderabad 2005 150000 Diesel Manual Second 80.00 10 5.64 2499 14.80 1.73
2575 Chevrolet Tavera LS B3 10 Seats BSIII Hyderabad 2015 120000 Diesel Manual First 80.00 10 5.50 2499 14.80 1.70

7247 rows × 13 columns

In [91]:
# id row where Seats = 0
df[df.Seats == 0.00]
Out[91]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats Price Engine_num Mileage_num price_log
S.No.
In [92]:
# let's look at box plot to see if outliers have been treated or not
plt.figure(figsize=(20, 30))

for i, variable in enumerate(numerical_features):
    plt.subplot(5, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Price

Decision:

Without having more information about the target market for this business its hard to decide whether to remove the highest price Used Cars over 150Lahks. I imagine in many cases they are purchased new and not used. I am only losing ~1% of the dataset which seems reasonable. For the sake the modeling we are doing here and the need to normalize it, I will drop them.

In [93]:
# check the drop
df[df.Price >= 150.00]
Out[93]:
Name Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats Price Engine_num Mileage_num price_log
S.No.
4079 Land Rover Range Rover 3.0 Diesel LWB Vogue Hyderabad 2017 25000 Diesel Automatic First 255.00 5 160.00 2993 13.33 5.08
In [94]:
# drop rows
df.drop(df[df.Price >= 150.00].index, inplace=True)
In [95]:
# let's look at box plot to see if outliers have been treated or not
plt.figure(figsize=(20, 30))

for i, variable in enumerate(numerical_features):
    plt.subplot(5, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
In [96]:
# plot indepdents against dependent
for i, variable in enumerate(numerical_features):
    sns.scatterplot(x=df[variable], y=df.Price)
    plt.show()
In [97]:
# get the size of the current df vs the original
display(len(df))
display("retained", (len(df) / 7253) * 100, "% of original data")
7246
'retained'
99.90348821177444
'% of original data'

After treatment we only lost 5 records from our original 7253 records.

Observations

  • After treating outliers we have a largely cleaned dataframe for extreme values.
  • After removing them there is still 98% of our original records in place.
  • This should be a good set-up for distribution and skewness adjustments

Feature Engineering¶

Names¶

As a key feature against our target varialbe (Price), Name needs to get reassigned into two separate variables to account for the inherent data-structure (Brand + Model)

In [98]:
df["Name"].nunique()
Out[98]:
2037

I will try to pull out the Make vs the Model to make it simpler.

In [99]:
# import module for Categorical
from pandas.api.types import CategoricalDtype

# split the column after the first space
cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
df[["Brand", "Model"]] = df.Name.str.split(" ", 1, expand=True)

# convert to categorical
df["Brand"] = pd.Categorical(df.Brand)
df["Model"] = pd.Categorical(df.Model)
df[["Brand", "Model"]].dtypes
Out[99]:
Brand    category
Model    category
dtype: object
In [100]:
# value counts
df["Brand"].value_counts()
Out[100]:
Maruti           1444
Hyundai          1340
Honda             741
Toyota            506
Mercedes-Benz     380
Volkswagen        374
Ford              351
Mahindra          330
BMW               311
Audi              284
Tata              228
Skoda             202
Renault           170
Chevrolet         151
Nissan            117
Land               66
Jaguar             48
Fiat               38
Mitsubishi         36
Mini               31
Volvo              28
Jeep               19
Porsche            19
Datsun             17
ISUZU               3
Force               3
Isuzu               2
Bentley             2
Ambassador          1
Smart               1
OpelCorsa           1
Hindustan           1
Lamborghini         1
Name: Brand, dtype: int64
In [101]:
# plot
plt.figure(figsize=(15, 7))
sns.countplot(y="Brand", data=df, order=df["Brand"].value_counts().index)
Out[101]:
<AxesSubplot:xlabel='count', ylabel='Brand'>

After separating out the Name I can safely drop it.

In [102]:
# drop a column
df.drop("Name", inplace=True, axis=1)

Let's review our new columns

Brand.¶

Let's check out our Brand by Price to find breakdowns of car classes. Here we look at mean price by brand.

In [103]:
# find unique count
df.Brand.nunique()
Out[103]:
33

Correct Model for first name.

In [104]:
df["Model"] = df.Model.str.split(" ", 1, expand=True)
In [105]:
# find unique count
df.Model.nunique()
Out[105]:
217
In [106]:
# Check the data
df["Model"].value_counts()
Out[106]:
Swift         418
City          316
i20           303
Innova        203
Verna         200
             ... 
Countryman      1
Beetle          1
CLS-Class       1
Boxster         1
1000            1
Name: Model, Length: 217, dtype: int64
In [107]:
plt.figure(figsize=(15, 7))
sns.countplot(y="Model", data=df, order=df["Model"].value_counts().index[:30])
Out[107]:
<AxesSubplot:xlabel='count', ylabel='Model'>

Since there are so many models lets see if we can reduce the load on the visualizations and modeling to come up with some more generalized accounts of different models.

217 unique values is better than over 2000 but is still alot. For the sake of a better model, let's drop this level of specificity and focus on Brand. We will create categories based on price.

In [108]:
# View mean of one variable by another
df.groupby(["Brand"])["Price"].mean().sort_values(ascending=False)
Out[108]:
Brand
Lamborghini     120.00
Porsche          46.10
Land             33.86
Bentley          32.32
Jaguar           32.30
Mini             23.47
Mercedes-Benz    23.36
BMW              22.28
Audi             22.13
Jeep             15.97
Volvo            15.51
Isuzu            12.82
Toyota           10.45
ISUZU             9.91
Mitsubishi        9.70
Force             9.33
Mahindra          7.60
Skoda             7.28
Ford              6.71
Renault           5.78
OpelCorsa         5.64
Hindustan         5.64
Honda             5.45
Hyundai           5.40
Volkswagen        5.36
Nissan            4.94
Maruti            4.70
Tata              3.95
Fiat              3.89
Datsun            3.66
Chevrolet         3.56
Smart             3.00
Ambassador        1.35
Name: Price, dtype: float64

Observations

  1. Notice that "ISUZU" and "Isuzu" are both found so let's title-case the series.
  2. We can see our outliers makes: Lambourghini and Ambassador - lets check them.
  3. We have some general price breaks that we might use.
In [109]:
# make lower case
df.Brand = df.Brand.str.title()
In [110]:
# View mean of one variable by another
df.Brand.unique()
Out[110]:
array(['Maruti', 'Hyundai', 'Honda', 'Audi', 'Nissan', 'Toyota',
       'Volkswagen', 'Tata', 'Land', 'Mitsubishi', 'Renault',
       'Mercedes-Benz', 'Bmw', 'Mahindra', 'Ford', 'Porsche', 'Datsun',
       'Jaguar', 'Volvo', 'Chevrolet', 'Skoda', 'Mini', 'Fiat', 'Jeep',
       'Smart', 'Ambassador', 'Isuzu', 'Force', 'Bentley', 'Lamborghini',
       'Hindustan', 'Opelcorsa'], dtype=object)
In [111]:
# get lowest and highest car values
df.loc[(df.Brand == "Ambassador") | (df.Brand == "Lamborghini")]
Out[111]:
Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats Price Engine_num Mileage_num price_log Brand Model
S.No.
1221 Chennai 2003 80000 Diesel Manual Third 35.50 5 1.35 1489 12.80 0.30 Ambassador Classic
5781 Delhi 2011 6500 Petrol Automatic Third 560.00 2 120.00 5204 6.40 4.79 Lamborghini Gallardo

Binning Brand¶

We will bucket the Brands or Makes into Economic Types.

Let's bin by generalized appearance of breakdown from the Price column.

In [112]:
df["Car_category"] = pd.cut(df.Price, bins=[0, 15, 30, 50, 200])
df.head()
Out[112]:
Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats Price Engine_num Mileage_num price_log Brand Model Car_category
S.No.
0 Mumbai 2010 72000 CNG Manual First 58.16 5 1.75 998 26.60 0.56 Maruti Wagon (0, 15]
1 Pune 2015 41000 Diesel Manual First 126.20 5 12.50 1582 19.67 2.53 Hyundai Creta (0, 15]
2 Chennai 2011 46000 Petrol Manual First 88.70 5 4.50 1199 18.20 1.50 Honda Jazz (0, 15]
3 Chennai 2012 87000 Diesel Manual First 88.76 7 6.00 1248 20.77 1.79 Maruti Ertiga (0, 15]
4 Coimbatore 2013 40670 Diesel Automatic Second 140.80 5 17.74 1968 15.20 2.88 Audi A4 (15, 30]
In [113]:
# value counts
df["Car_category"].value_counts().sort_index()
Out[113]:
(0, 15]      6265
(15, 30]      634
(30, 50]      259
(50, 200]      88
Name: Car_category, dtype: int64

It's clear that the majority of the used cars are between 0 - 10 Lakh's. Let's label the bins by some generic car terminology.

In [114]:
# drop column
df.drop(columns="Car_category", inplace=True)
In [115]:
# make cut with labels
df["Car_category"] = pd.cut(
    df.Price,
    bins=[0, 15, 30, 50, 100],
    labels=["Budget_Friendly", "Mid_Range", "Luxury_Cars", "Ultra_luxury"],
)
df.head()
Out[115]:
Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power Seats Price Engine_num Mileage_num price_log Brand Model Car_category
S.No.
0 Mumbai 2010 72000 CNG Manual First 58.16 5 1.75 998 26.60 0.56 Maruti Wagon Budget_Friendly
1 Pune 2015 41000 Diesel Manual First 126.20 5 12.50 1582 19.67 2.53 Hyundai Creta Budget_Friendly
2 Chennai 2011 46000 Petrol Manual First 88.70 5 4.50 1199 18.20 1.50 Honda Jazz Budget_Friendly
3 Chennai 2012 87000 Diesel Manual First 88.76 7 6.00 1248 20.77 1.79 Maruti Ertiga Budget_Friendly
4 Coimbatore 2013 40670 Diesel Automatic Second 140.80 5 17.74 1968 15.20 2.88 Audi A4 Mid_Range

I will need to one-hot-encode these categories for modeling. For now lets make sure they are categorical types.

In [116]:
cat_vars = ["Car_category", "Model", "Brand"]

for colname in cat_vars:
    df[colname] = df[colname].astype("category")

df.info()  # Explore dataframe information, check dtype, and nulls.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7246 entries, 0 to 7252
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Location           7246 non-null   category
 1   Year               7246 non-null   int64   
 2   Kilometers_Driven  7246 non-null   int64   
 3   Fuel_Type          7246 non-null   category
 4   Transmission       7246 non-null   category
 5   Owner_Type         7246 non-null   category
 6   Power              7246 non-null   float64 
 7   Seats              7246 non-null   int64   
 8   Price              7246 non-null   float64 
 9   Engine_num         7246 non-null   int64   
 10  Mileage_num        7246 non-null   float64 
 11  price_log          7246 non-null   float64 
 12  Brand              7246 non-null   category
 13  Model              7246 non-null   category
 14  Car_category       7245 non-null   category
dtypes: category(7), float64(4), int64(4)
memory usage: 836.3 KB
In [117]:
# get value counts
df.Car_category.value_counts()
Out[117]:
Budget_Friendly    6265
Mid_Range           634
Luxury_Cars         259
Ultra_luxury         87
Name: Car_category, dtype: int64

Exploratory Data Analysis¶

Summary statistics¶

In [118]:
# 5 (Quantile) Statistics for numerical variables
df.describe()
Out[118]:
Year Kilometers_Driven Power Seats Price Engine_num Mileage_num price_log
count 7246.00 7246.00 7246.00 7246.00 7246.00 7246.00 7246.00 7246.00
mean 2013.37 57810.70 112.30 5.28 8.80 1615.41 18.35 1.81
std 3.25 37504.19 52.88 0.81 10.12 592.75 4.13 0.80
min 1996.00 171.00 34.20 2.00 0.44 624.00 6.40 -0.82
25% 2011.00 34000.00 77.00 5.00 3.85 1198.00 15.40 1.35
50% 2014.00 53469.50 94.34 5.00 5.64 1493.00 18.20 1.73
75% 2016.00 73000.00 138.03 5.00 8.40 1968.00 21.10 2.13
max 2019.00 775000.00 616.00 10.00 120.00 5998.00 33.54 4.79
In [119]:
# Basic statistics for Categorical variables
df.describe(include=["category"])
Out[119]:
Location Fuel_Type Transmission Owner_Type Brand Model Car_category
count 7246 7246 7246 7246 7246 7246 7245
unique 11 4 2 4 32 217 4
top Mumbai Diesel Manual First Maruti Swift Budget_Friendly
freq 946 3850 5202 5945 1444 418 6265

Set-up for visualization¶

In [120]:
# Function from GreatLearning on %'s for bar graphs
def per_on_bar(plot, feature):
    total = len(feature)
    for p in ax.patches:
        perc = "{:1.f}%".format(100 * p.get_height() / total)
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + g.get_height()
        ax.annotate(perc, (x, y), size=12)
    plt.show()
In [121]:
categoricals = df.select_dtypes(include=["object", "category", "string"])
global categoricals
[categoricals]
Out[121]:
[         Location Fuel_Type Transmission Owner_Type          Brand    Model  \
 S.No.                                                                         
 0          Mumbai       CNG       Manual      First         Maruti    Wagon   
 1            Pune    Diesel       Manual      First        Hyundai    Creta   
 2         Chennai    Petrol       Manual      First          Honda     Jazz   
 3         Chennai    Diesel       Manual      First         Maruti   Ertiga   
 4      Coimbatore    Diesel    Automatic     Second           Audi       A4   
 ...           ...       ...          ...        ...            ...      ...   
 7248    Hyderabad    Diesel       Manual      First     Volkswagen    Vento   
 7249       Mumbai    Petrol    Automatic      First     Volkswagen     Polo   
 7250      Kolkata    Diesel       Manual      First         Nissan    Micra   
 7251         Pune    Petrol    Automatic      Third     Volkswagen     Polo   
 7252        Kochi    Diesel    Automatic      First  Mercedes-Benz  E-Class   
 
           Car_category  
 S.No.                   
 0      Budget_Friendly  
 1      Budget_Friendly  
 2      Budget_Friendly  
 3      Budget_Friendly  
 4            Mid_Range  
 ...                ...  
 7248   Budget_Friendly  
 7249   Budget_Friendly  
 7250   Budget_Friendly  
 7251   Budget_Friendly  
 7252   Budget_Friendly  
 
 [7246 rows x 7 columns]]

Univariate¶

In [122]:
# Function to perform statistical analysis on every numeric and categorical variable in the dataframe
pd.set_option("display.max_colwidth", 5000)


def univar_num(dataframe):  # argument is the whole dataframe

    # Print statistics for numeric variables
    print("\n Mode Analysis \n", df.mode("index")[0:2])  # Mode (or most popular)
    print("\n Sum Analysis \n", numerical_features.sum())  # Sum analysis
    print("\n Variance analysis \n", numerical_features.var())  # Variance analysis
    print(
        "\n Absolute Deviation or Mean Absolute Deviation \n", numerical_features.mad()
    )  # Absolute Deviation or Mean Absolute Deviation
    print("\n  Skew analysis \n ", numerical_features.skew())  # Skew analysis

    return
In [123]:
# Run dataframe through function
univar_num(numerical_features)
 Mode Analysis 
   Location  Year  Kilometers_Driven Fuel_Type Transmission Owner_Type  Power  \
0   Mumbai  2015              60000    Diesel       Manual      First  74.00   

   Seats  Price  Engine_num  Mileage_num  price_log   Brand  Model  \
0      5   5.64        1197        17.00       1.73  Maruti  Swift   

      Car_category  
0  Budget_Friendly  

 Sum Analysis 
 Year                 14594892.00
Kilometers_Driven   425546305.00
Power                  814311.69
Seats                   38259.00
Price                   63979.65
Engine_num           11714459.00
Mileage_num            132984.80
price_log               13111.35
dtype: float64

 Variance analysis 
 Year                        10.59
Kilometers_Driven   7131921139.95
Power                     2801.12
Seats                        0.65
Price                      106.00
Engine_num              352069.92
Mileage_num                 17.10
price_log                    0.64
dtype: float64

 Absolute Deviation or Mean Absolute Deviation 
 Year                    2.58
Kilometers_Driven   26624.55
Power                  39.54
Seats                   0.53
Price                   6.22
Engine_num            460.64
Mileage_num             3.31
price_log               0.57
dtype: float64

  Skew analysis 
  Year                -0.84
Kilometers_Driven   61.57
Power                2.00
Seats                1.91
Price                3.73
Engine_num           1.42
Mileage_num          0.21
price_log            0.52
dtype: float64
In [124]:
# Frequency tables for every feature
for (
    column_name,
    column_data,
) in (
    df.iteritems()
):  # for every column in the dataframe return the column name and Series of data
    display((column_name, column_data.value_counts()))
('Location',
 Mumbai        946
 Hyderabad     874
 Coimbatore    772
 Kochi         772
 Pune          765
 Delhi         660
 Kolkata       654
 Chennai       589
 Jaipur        499
 Bangalore     440
 Ahmedabad     275
 Name: Location, dtype: int64)
('Year',
 2015    929
 2014    925
 2016    885
 2013    791
 2017    707
 2012    689
 2011    578
 2010    405
 2018    361
 2009    252
 2008    207
 2007    148
 2019    119
 2006     89
 2005     68
 2004     35
 2003     20
 2002     18
 2001      8
 2000      5
 1998      4
 1999      2
 1996      1
 Name: Year, dtype: int64)
('Kilometers_Driven',
 60000    96
 45000    86
 65000    86
 70000    77
 50000    71
          ..
 24350     1
 58217     1
 60268     1
 21363     1
 64917     1
 Name: Kilometers_Driven, Length: 3659, dtype: int64)
('Fuel_Type',
 Diesel    3850
 Petrol    3322
 CNG         62
 LPG         12
 Name: Fuel_Type, dtype: int64)
('Transmission',
 Manual       5202
 Automatic    2044
 Name: Transmission, dtype: int64)
('Owner_Type',
 First             5945
 Second            1152
 Third              137
 Fourth & Above      12
 Name: Owner_Type, dtype: int64)
('Power',
 74.00     280
 94.34     174
 98.60     166
 73.90     152
 140.00    142
          ... 
 328.50      1
 174.57      1
 560.00      1
 231.10      1
 83.11       1
 Name: Power, Length: 383, dtype: int64)
('Seats',
 5     6094
 7      796
 8      170
 4      119
 6       38
 2       18
 10       8
 9        3
 Name: Seats, dtype: int64)
('Price',
 5.64     1238
 4.50       88
 5.50       84
 3.50       82
 4.25       73
          ... 
 14.29       1
 21.26       1
 55.70       1
 43.74       1
 2.22        1
 Name: Price, Length: 1372, dtype: int64)
('Engine_num',
 1197    732
 1248    610
 1498    370
 998     309
 1198    281
 2179    278
 1497    271
 1968    266
 1995    212
 1461    188
 2143    184
 1199    177
 1582    174
 1396    170
 1598    164
 796     164
 2494    154
 1086    129
 1591    113
 1399    106
 2993    104
 2982    103
 1493    103
 1798     99
 2987     78
 814      74
 2967     67
 1120     60
 1373     59
 1364     58
 1196     57
 2354     53
 1298     48
 2755     42
 1991     41
 799      40
 1896     39
 1799     39
 999      36
 1998     35
 1496     33
 1061     32
 624      31
 1796     29
 2393     28
 1586     26
 2498     26
 1499     25
 936      25
 1405     25
 1794     24
 1193     24
 1984     22
 2523     22
 1997     22
 1999     21
 2148     21
 1495     20
 1186     18
 2477     17
 2199     17
 1956     17
 1368     17
 2499     16
 995      15
 3198     15
 2497     14
 993      14
 2696     14
 3498     13
 2400     13
 1595     13
 1599     13
 2835     12
 1341     11
 1388     11
 1299     11
 2953     10
 1950      9
 2198      8
 1150      8
 4134      8
 2489      8
 1596      7
 2996      7
 2979      7
 1462      7
 1172      6
 1590      6
 2698      6
 2496      6
 1047      5
 2446      5
 793       5
 4367      5
 2359      5
 2609      5
 1390      4
 4395      4
 4806      4
 2362      4
 1343      4
 1797      4
 1194      3
 2349      3
 2956      3
 5461      3
 1395      3
 1969      3
 1948      2
 5998      2
 3597      2
 2360      2
 2894      2
 2200      2
 2997      2
 2495      2
 2771      2
 1242      2
 2487      2
 5000      2
 1985      2
 2999      2
 1781      2
 1597      2
 2147      2
 3436      2
 1468      1
 1978      1
 4951      1
 2149      1
 1795      1
 2694      1
 2773      1
 2925      1
 2995      1
 3200      1
 1389      1
 2112      1
 5204      1
 2706      1
 3696      1
 2720      1
 1489      1
 1422      1
 2000      1
 2092      1
 970       1
 Name: Engine_num, dtype: int64)
('Mileage_num',
 17.00    206
 18.90    201
 18.60    144
 18.20    115
 21.10    107
         ... 
 16.12      1
 12.97      1
 22.80      1
 11.62      1
 7.50       1
 Name: Mileage_num, Length: 437, dtype: int64)
('price_log',
 1.73    1238
 1.50      88
 1.70      84
 1.25      82
 1.45      73
         ... 
 0.63       1
 1.43       1
 4.17       1
 1.38       1
 4.53       1
 Name: price_log, Length: 1372, dtype: int64)
('Brand',
 Maruti           1444
 Hyundai          1340
 Honda             741
 Toyota            506
 Mercedes-Benz     380
 Volkswagen        374
 Ford              351
 Mahindra          330
 Bmw               311
 Audi              284
 Tata              228
 Skoda             202
 Renault           170
 Chevrolet         151
 Nissan            117
 Land               66
 Jaguar             48
 Fiat               38
 Mitsubishi         36
 Mini               31
 Volvo              28
 Jeep               19
 Porsche            19
 Datsun             17
 Isuzu               5
 Force               3
 Bentley             2
 Lamborghini         1
 Opelcorsa           1
 Hindustan           1
 Smart               1
 Ambassador          1
 Name: Brand, dtype: int64)
('Model',
 Swift       418
 City        316
 i20         303
 Innova      203
 Verna       200
            ... 
 Fusion        1
 Venture       1
 Gallardo      1
 Land          1
 Motors        1
 Name: Model, Length: 217, dtype: int64)
('Car_category',
 Budget_Friendly    6265
 Mid_Range           634
 Luxury_Cars         259
 Ultra_luxury         87
 Name: Car_category, dtype: int64)

Observations:

  1. Top Brand is Maruti and Model is Swift.

  2. The top Year for Used Cars is 2015.

  3. Most Owners are first time for these cars.

Visualizations¶

In [125]:
# Function to create distribution and statistical plots for each numerical feature in the dataframe
def univar_vis(dataframe):

    sns.set(style="ticks")
    sns.set_style("darkgrid")

    # Univariate visualization of categoricals variables
    for (column_name_c, column_data_c) in categoricals.iteritems():

        print(column_name_c, "Count Plot")
        plt.figure(figsize=(20, 10))
        sns.countplot(x=column_data_c)  # Bar plot against a numeric variable
        plt.show()

        """
        print(column_name_c, "Distribution Plot")
        plt.figure(figsize=(20, 10))
        sns.displot(data=column_data_c)  # distribution plot
        plt.show()
        """

    # Univariate visualization of numeric variables
    for (column_name, column_data) in numerical_features.iteritems():

        print(column_name, "Histogram")
        sns.histplot(df[column_name], kde=True)  # histogram
        plt.figure(figsize=(20, 10))
        plt.show()

        """
        print(column_name, "Distribution Plot")
        plt.figure(figsize=(20, 10))
        sns.displot(data=column_data)  # distribution plot
        plt.show()
        """

        print(column_name, "Density Plot")
        sns.kdeplot(
            data=column_data, x=numerical_features[column_name]
        )  # Kernel distribution
        plt.figure(figsize=(20, 10))
        plt.show()

        print(column_name, "Box Plot")
        sns.boxplot(data=column_data, y=df[column_name])  # Box plot
        plt.figure(figsize=(20, 10))
        plt.show()

        # raise Exception("End to kill infinite loop")

    return
In [126]:
# Run dataframe through function
univar_vis(df)
Location Count Plot
Fuel_Type Count Plot
Transmission Count Plot
Owner_Type Count Plot
Brand Count Plot
Model Count Plot
Car_category Count Plot
Year Histogram
<Figure size 1440x720 with 0 Axes>
Year Density Plot
<Figure size 1440x720 with 0 Axes>
Year Box Plot
<Figure size 1440x720 with 0 Axes>
Kilometers_Driven Histogram
<Figure size 1440x720 with 0 Axes>
Kilometers_Driven Density Plot
<Figure size 1440x720 with 0 Axes>
Kilometers_Driven Box Plot
<Figure size 1440x720 with 0 Axes>
Power Histogram
<Figure size 1440x720 with 0 Axes>
Power Density Plot
<Figure size 1440x720 with 0 Axes>
Power Box Plot
<Figure size 1440x720 with 0 Axes>
Seats Histogram
<Figure size 1440x720 with 0 Axes>
Seats Density Plot
<Figure size 1440x720 with 0 Axes>
Seats Box Plot
<Figure size 1440x720 with 0 Axes>
Price Histogram
<Figure size 1440x720 with 0 Axes>
Price Density Plot
<Figure size 1440x720 with 0 Axes>
Price Box Plot
<Figure size 1440x720 with 0 Axes>
Engine_num Histogram
<Figure size 1440x720 with 0 Axes>
Engine_num Density Plot
<Figure size 1440x720 with 0 Axes>
Engine_num Box Plot
<Figure size 1440x720 with 0 Axes>
Mileage_num Histogram
<Figure size 1440x720 with 0 Axes>
Mileage_num Density Plot
<Figure size 1440x720 with 0 Axes>
Mileage_num Box Plot
<Figure size 1440x720 with 0 Axes>
price_log Histogram
<Figure size 1440x720 with 0 Axes>
price_log Density Plot
<Figure size 1440x720 with 0 Axes>
price_log Box Plot
<Figure size 1440x720 with 0 Axes>
In [127]:
# From DSBA_SLF_MLS1_LifeExpectancyCaseStudy
all_col = df.select_dtypes(include=["float64", "int64"]).columns.tolist()

plt.figure(figsize=(17, 75))

for i in range(len(all_col)):
    plt.subplot(18, 3, i + 1)
    plt.hist(df[all_col[i]])
    sns.histplot(
        df[all_col[i]], kde=True
    )  # you can comment the previous line and run this one to get distribution curves
    plt.tight_layout()
    plt.title(all_col[i], fontsize=25)

plt.show()
In [128]:
### Code from Mounika of GreatLearning as given in the Uber Case study ###

# While doing uni-variate analysis of numerical variables we want to study their central tendency
# and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical
# variable.
# This function takes the numerical column as the input and returns the boxplots
# and histograms for the variable.
# Let us see if this help us write faster and cleaner code.
def histogram_boxplot(feature, figsize=(15, 10), bins=None):
    """Boxplot and histogram combined
    feature: 1-d feature array
    figsize: size of fig (default (9,8))
    bins: number of bins (default None / auto)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        feature, ax=ax_box2, showmeans=True, color="red"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins) if bins else sns.distplot(
        feature, kde=False, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        np.mean(feature), color="g", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        np.median(feature), color="black", linestyle="-"
    )  # Add median to the histogram
In [129]:
display(numerical_features.apply(lambda x: histogram_boxplot(x)))
Year                 None
Kilometers_Driven    None
Power                None
Seats                None
Price                None
Engine_num           None
Mileage_num          None
price_log            None
dtype: object

Bivariate¶

In [130]:
# Covariance analysis
df.cov()
Out[130]:
Year Kilometers_Driven Power Seats Price Engine_num Mileage_num price_log
Year 10.59 -54876.52 4.16 0.03 9.15 -102.53 3.98 1.20
Kilometers_Driven -54876.52 1406564098.90 -2580.42 6435.57 -58965.28 3509610.72 -22085.82 -5261.39
Power 4.16 -2580.42 2796.50 4.18 379.00 26799.62 -118.73 29.28
Seats 0.03 6435.57 4.18 0.65 0.41 193.12 -1.16 0.10
Price 9.15 -58965.28 379.00 0.41 102.44 3637.38 -12.61 6.87
Engine_num -102.53 3509610.72 26799.62 193.12 3637.38 351346.92 -1564.60 297.06
Mileage_num 3.98 -22085.82 -118.73 -1.16 -12.61 -1564.60 17.09 -0.90
price_log 1.20 -5261.39 29.28 0.10 6.87 297.06 -0.90 0.63
In [131]:
# Correlation Analysis
corr = df.corr()
display(corr)
Year Kilometers_Driven Power Seats Price Engine_num Mileage_num price_log
Year 1.00 -0.45 0.02 0.01 0.28 -0.05 0.30 0.46
Kilometers_Driven -0.45 1.00 -0.00 0.21 -0.16 0.16 -0.14 -0.18
Power 0.02 -0.00 1.00 0.10 0.71 0.85 -0.54 0.70
Seats 0.01 0.21 0.10 1.00 0.05 0.40 -0.35 0.15
Price 0.28 -0.16 0.71 0.05 1.00 0.61 -0.30 0.85
Engine_num -0.05 0.16 0.85 0.40 0.61 1.00 -0.64 0.63
Mileage_num 0.30 -0.14 -0.54 -0.35 -0.30 -0.64 1.00 -0.27
price_log 0.46 -0.18 0.70 0.15 0.85 0.63 -0.27 1.00

Looking for anything close to one, anything above .8 is of interest.

In [132]:
# Plot the heatmap of correlations
plt.figure(figsize=(16, 12))
display("")
sns.heatmap(
    corr,
    annot=True,
    cmap="coolwarm",
    fmt=".1f",
    ### Categorical variables   xticklabels=corr.columns,
    yticklabels=corr.columns,
)
''
Out[132]:
<AxesSubplot:>

Observations:

  1. Power and Engine are highly positively correlated. One of them will be dropped for the Linear Reg. model.

  2. Power and Price are the next in line.

  3. Price and Engine & Mileage and Engine are 3rd at .6 (-.06) degree of correlation strength. The rest of data is not very correlated.

EDA Report¶

In [133]:
# write Pandas Profile report
profile = ProfileReport(df, title="Pandas Profiling Report")
In [134]:
# create iframe version
profile.to_notebook_iframe()

Feature of Interest: Price¶

Identify the key variables that have a strong relationship with dependent variable

Note The independent variables are operating on a log-scale at this point.

Cross the top value from Fuel_Type and Price to get a best guess of linear relationship¶
In [135]:
# create a lineplot
plt.figure(figsize=(20, 10))
sns.lineplot(y="Price", x="Transmission", data=df, ci=None)
Out[135]:
<AxesSubplot:xlabel='Transmission', ylabel='Price'>
Cross Engine and Price to get a best guess of linear relationship¶
In [136]:
# create a lineplot
plt.figure(figsize=(20, 15))
sns.lineplot(y="Price", x="Engine_num", data=df, ci=None)
Out[136]:
<AxesSubplot:xlabel='Engine_num', ylabel='Price'>
Cross Power and Price to get a best guess of linear relationship¶
In [137]:
# create a lineplot
plt.figure(figsize=(20, 15))
sns.lineplot(y="Price", x="Power", data=df, ci=None)
Out[137]:
<AxesSubplot:xlabel='Power', ylabel='Price'>

Temporal variables are specificly used based on the case and understanding how and when to apply them is due to time and lots of practice with them. It's always up to the DS to decide which vars to use and not. It's a subjective process.

In [138]:
# average Price over the years
plt.figure(figsize=(15, 7))
sns.lineplot(x="Year", y="Price", data=df, ci=None)
Out[138]:
<AxesSubplot:xlabel='Year', ylabel='Price'>

Observations

  • Transmission has a negative linear relationship with Price when moving from Automatic to Manual.
  • Engine & Power appear to have positive linear relationships with Price.
  • Price seems to increase over time.
In [139]:
# barplot
sns.catplot(
    data=df, x=df["Location"], y=df["Price"], kind="box", orient="v", height=5, aspect=2
)
Out[139]:
<seaborn.axisgrid.FacetGrid at 0x20ed6bea3a0>

Here are the correlations for looking into multicolinearity

In [140]:
# Get correlation of each variable against price
df[df.columns[:]].corr().sort_values("Price", ascending=False)["Price"][:]
Out[140]:
Price                1.00
price_log            0.85
Power                0.71
Engine_num           0.61
Year                 0.28
Seats                0.05
Kilometers_Driven   -0.16
Mileage_num         -0.30
Name: Price, dtype: float64
In [141]:
# GreatLearning code
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
numeric_columns.remove("Year")  # dropping year column as it is temporal variable
corr = (
    df[numeric_columns].corr().sort_values(by=["Price"], ascending=False)
)  # sorting correlations w.r.t Price

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(28, 15))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(
    corr,
    mask=np.triu(
        corr
    ),  # write a mask to exlucde anything that is less than 0.01 but not easy to work with
    cmap="seismic",
    annot=True,
    fmt=".1f",
    vmin=-1,
    vmax=1,
    center=0,
    square=False,
    linewidths=0.7,
    cbar_kws={"shrink": 0.5},
)
Out[141]:
<AxesSubplot:>

Observations

  • Same correlations exist as previously stated between Engine and Power which may be why they are at the top.
  • Because are closely related I will drop one of them to reduce Multicolinearity.
  • No correlations are above .8 so I am not too worried about Multicorlinearity. I will drop later will needed.

Price as the dependent variable against every numeric variable colored for categoricals.

In [142]:
# Cross the numeric variables with Price and see the spread across the other categorical variables
for (column_name_c, column_data_c) in numerical_features.items():
    for (column_name, column_data) in categoricals.iteritems():
        sns.histplot(
            x=column_data_c,
            y=df.Price,
            data=df,
            hue=column_name,
            multiple="stack",
            shrink=0.8,
        )
        # plt.figure(figsize=(20,20))
        plt.show()

Inferences¶

Key meaningful observations on the relationship between variables.

S.No. :

  • Serial Number was added as an index column.

Name :

  • Was changed into Make and Model to account for wrangling issues in visualization and will be useful to enconde for modeling.

Location :

  • Mumbai is the most popular city with over 800 records.
  • Will need to be encoded for modeling and has a reasonable amount of categoricals to do so.
  • Ahmedabad has the least records with 275 a spread of 525 records from Mumbai.

Year :

  • 2014 - 2016 are the most popular years for used cars that are for sale.
  • Skew to the left with most years being above 2010.
  • Older cars are not represented at all. Anything before 1995 doesn't make it into the dataset.

Kilometers_driven :

  • The average kilometers driven in the car by the previous owner(s) in KM is 58699.06KM.
  • Is heavily skewed to the right with a variance of 7129993975.25. This may need treatment.
  • There is one vehicle at 6500000km driver which seems vastly unrealistic and will need to be treated.

Fuel_Type :

  • The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG) with the vast majority being Diesel or Petrol.
  • liquefied petroleum gas is included with a <1% repreentation.

Transmission :

  • The type of transmission used by the cars are mostly Manual.
  • The automatic types only account for ~28% of all cars in the dataset.
  • These two types should be easy to build into the model after encoding.

Owner :

  • Type of ownership is dominated by First owners.
  • There are nearly no cars that have passed more than 4 owners hands.
  • These 4 values should be easily handled with dummy variables for model building.

Mileage :

  • The standard mileage offered by the car company ranges from 0 to 33.54.
  • The distribution is almost normal with a few outliers above 30mpL, this should make modeling easier.
  • A light negative correlation with Power.

Engine :

  • The mean and median are fairly close at only 200 CC away from eachother. 50% lies above and below 1493.00 CC.
  • Is strongly negatively correlated with Power which is to be expected.
  • A slight negative correlation with Mileage.

Power :

  • The maximum power of the engine in bhp in this dataset is 616bhp.
  • Quite a few outliers which look to be outside the 1.5*IQR - will need to handle before modeling.
  • 75% of the data lies below 138bhp.

Seats :

  • 5 is the number of seats in the car that dominates the dataset at 84% of all values.
  • There is one car with 0 seats which may have been a typo.
  • Seats is almost normally distributed which is easier for the modeling.

New_Price :

  • The price of a new car of the same model in INR Lakhs.(1 Lakh = $100, 000 INR) is irrelevant as it's in a different market.
  • New Price has so many missing values (86%) that it was decided to drop the column for analysis.
  • As new car prices go up so do the sales of used cars and thus creating a shortage which should push up used car prices.

Price : The price of the used car in INR Lakhs (1 Lakh = $100, 000 INR)

  • Price has some correlation with Power.
  • The majority of the data is of used cars with 5.64Lakh value meaning over 500,000dollars which seams unrealistic but there are cars at the market price
  • The data is heavily right-skewed.

    Brand, Model and Car_category: has an obvious positive correlation with Price.

Summary¶

Our main insights were:

1. The `Price` estimate of central tendency (the mean (or other estimator) value) of used cars is much higher in Coimbatore and Bangalore than other locations.

2. `Price` for Petrol vehicles are many outside of the mean in the positive direction.

3.  Used cars with a Third owner tend to be before 2015 and have lower prices than other owners.

Model Building - Linear Regression¶

  1. Choose, train and evaluate an inferential model using Scikitlearn
  2. Build the inferential model and comment on the model statistics using Statsmodel

Dummy variables¶

Encoding Categoricals

Notes from GreatLearning:

"Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not.

Sometimes we want to one-hot encode where every level gets an indicator, but other times we want to drop one level to account for the fact that in a model we will have an intercept."

This is worth doing at the very end because for many earlier analysis tasks (e.g. visualization and value counts) it will be more convenient to keep categorical variables as a single column. Additionally, we'll want to make sure missing values are resolved by this point.

In [143]:
ind_vars = df.select_dtypes(include=["object", "category"]).columns.tolist()
ind_vars
Out[143]:
['Location',
 'Fuel_Type',
 'Transmission',
 'Owner_Type',
 'Brand',
 'Model',
 'Car_category']

With 217 Models to sort through I'm going to not take the effort of encoding them all and go ahead and drop that for our model. We still have Brand and Bins to help us make sense of the specific cars.

In [144]:
df.drop("Model", axis=1, inplace=True)
In [145]:
# get the size of the current df vs the original
(len(df) / 7253) * 100
Out[145]:
99.90348821177444
In [146]:
# generate binary values using get_dummies dropping the 1st one
def encode_cat_vars(x):
    x = pd.get_dummies(
        x,
        columns=x.select_dtypes(include=["object", "category"]).columns.tolist(),
        drop_first=True,
    )
    return x
In [147]:
# apply function
ind_vars_num = encode_cat_vars(df)
ind_vars_num.head()
Out[147]:
Year Kilometers_Driven Power Seats Price Engine_num Mileage_num price_log Location_Bangalore Location_Chennai Location_Coimbatore Location_Delhi Location_Hyderabad Location_Jaipur Location_Kochi Location_Kolkata Location_Mumbai Location_Pune Fuel_Type_Diesel Fuel_Type_LPG Fuel_Type_Petrol Transmission_Manual Owner_Type_Fourth & Above Owner_Type_Second Owner_Type_Third Brand_Audi Brand_Bentley Brand_Bmw Brand_Chevrolet Brand_Datsun Brand_Fiat Brand_Force Brand_Ford Brand_Hindustan Brand_Honda Brand_Hyundai Brand_Isuzu Brand_Jaguar Brand_Jeep Brand_Lamborghini Brand_Land Brand_Mahindra Brand_Maruti Brand_Mercedes-Benz Brand_Mini Brand_Mitsubishi Brand_Nissan Brand_Opelcorsa Brand_Porsche Brand_Renault Brand_Skoda Brand_Smart Brand_Tata Brand_Toyota Brand_Volkswagen Brand_Volvo Car_category_Mid_Range Car_category_Luxury_Cars Car_category_Ultra_luxury
S.No.
0 2010 72000 58.16 5 1.75 998 26.60 0.56 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2015 41000 126.20 5 12.50 1582 19.67 2.53 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2011 46000 88.70 5 4.50 1199 18.20 1.50 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 2012 87000 88.76 7 6.00 1248 20.77 1.79 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 2013 40670 140.80 5 17.74 1968 15.20 2.88 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
In [148]:
# check the df shape
ind_vars_num.shape
Out[148]:
(7246, 59)
In [149]:
# get the size of the current df vs the original
(len(ind_vars_num) / 7253) * 100
Out[149]:
99.90348821177444
In [150]:
# assign an object by another object
df = ind_vars_num
In [151]:
# get the size of the current df vs the original
(len(df) / 7253) * 100
Out[151]:
99.90348821177444

We have ~60 columns to model with. We will probably cut this down considerably. We still have the majority of our data intact.

Split Data¶

Split the data into train and test (can split training into Train and Validation since you do not report on the Training but only on the Test set.

At this point I will drop Price to look at price_log against each dependent variable.

In [152]:
# drop a column
df.drop("Price", axis=1, inplace=True)
In [153]:
# lets build our linear model
# independent variable by dropping out the Price variable
X = df.drop(["price_log"], axis=1)
# the dependent variable = Price
y = df[["price_log"]]
In [154]:
display(y.head())
price_log
S.No.
0 0.56
1 2.53
2 1.50
3 1.79
4 2.88
In [155]:
print(X.shape)
print(y.shape)
(7246, 57)
(7246, 1)
In [156]:
# Split X and y into training and test set in 70:30 ratio
# 70/30 split + psuedo randomness using random-seed to make a sort of scientific sample
# training and test set sizes should be equal
x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=1
)
In [157]:
# print the size of each x split
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in train data =", x_test.shape[0])
Number of rows in train data = 5072
Number of rows in train data = 2174
In [158]:
# check the head
x_train.head()
Out[158]:
Year Kilometers_Driven Power Seats Engine_num Mileage_num Location_Bangalore Location_Chennai Location_Coimbatore Location_Delhi Location_Hyderabad Location_Jaipur Location_Kochi Location_Kolkata Location_Mumbai Location_Pune Fuel_Type_Diesel Fuel_Type_LPG Fuel_Type_Petrol Transmission_Manual Owner_Type_Fourth & Above Owner_Type_Second Owner_Type_Third Brand_Audi Brand_Bentley Brand_Bmw Brand_Chevrolet Brand_Datsun Brand_Fiat Brand_Force Brand_Ford Brand_Hindustan Brand_Honda Brand_Hyundai Brand_Isuzu Brand_Jaguar Brand_Jeep Brand_Lamborghini Brand_Land Brand_Mahindra Brand_Maruti Brand_Mercedes-Benz Brand_Mini Brand_Mitsubishi Brand_Nissan Brand_Opelcorsa Brand_Porsche Brand_Renault Brand_Skoda Brand_Smart Brand_Tata Brand_Toyota Brand_Volkswagen Brand_Volvo Car_category_Mid_Range Car_category_Luxury_Cars Car_category_Ultra_luxury
S.No.
2080 2014 71000 126.32 5 1582 19.08 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
62 2015 58000 186.00 5 1796 11.74 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
6978 2008 131000 105.00 5 1896 15.00 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
3502 2015 54339 88.73 5 1396 22.54 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2156 2014 105044 100.60 7 2494 12.99 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

ScikitLearn Model¶

In [159]:
# is a best practice keep code minimal, can change the function that the object is holding but then the rest can run smoothly
# gives LinearRegression model
linearregression = LinearRegression()  # build a regression model from scikitlearn
# fitting the linear regression model on the train data (70% of the whole data)
linearregression.fit(x_train, y_train)
Out[159]:
LinearRegression()
In [160]:
# create the intercept
intercept = linearregression.intercept_[0]
print("The intercept for our model is {}".format(intercept))
The intercept for our model is -181.12333731112986

Coefficients & Intercept¶

In [161]:
# create coefficients and intercept on dataframe

coef_df = pd.DataFrame(
    np.append(linearregression.coef_[0], linearregression.intercept_[0]),
    index=x_train.columns.tolist() + ["Intercept"],
    columns=["Coefficients"],
)

coef_df
Out[161]:
Coefficients
Year 0.09
Kilometers_Driven -0.00
Power 0.00
Seats 0.01
Engine_num 0.00
Mileage_num -0.01
Location_Bangalore 0.07
Location_Chennai 0.00
Location_Coimbatore 0.05
Location_Delhi -0.05
Location_Hyderabad 0.06
Location_Jaipur -0.08
Location_Kochi -0.04
Location_Kolkata -0.14
Location_Mumbai 0.00
Location_Pune -0.04
Fuel_Type_Diesel 0.27
Fuel_Type_LPG 0.15
Fuel_Type_Petrol 0.07
Transmission_Manual -0.08
Owner_Type_Fourth & Above 0.10
Owner_Type_Second -0.05
Owner_Type_Third -0.15
Brand_Audi 0.41
Brand_Bentley 0.22
Brand_Bmw 0.49
Brand_Chevrolet 0.00
Brand_Datsun -0.03
Brand_Fiat 0.11
Brand_Force 0.59
Brand_Ford 0.19
Brand_Hindustan 0.00
Brand_Honda 0.34
Brand_Hyundai 0.27
Brand_Isuzu -0.20
Brand_Jaguar 0.33
Brand_Jeep 0.14
Brand_Lamborghini 0.00
Brand_Land 0.57
Brand_Mahindra 0.20
Brand_Maruti 0.26
Brand_Mercedes-Benz 0.44
Brand_Mini 0.60
Brand_Mitsubishi 0.31
Brand_Nissan 0.27
Brand_Opelcorsa 1.20
Brand_Porsche 0.26
Brand_Renault 0.27
Brand_Skoda 0.31
Brand_Smart 0.30
Brand_Tata -0.16
Brand_Toyota 0.35
Brand_Volkswagen 0.25
Brand_Volvo 0.24
Car_category_Mid_Range 0.70
Car_category_Luxury_Cars 0.98
Car_category_Ultra_luxury 1.19
Intercept -181.12

Statsmodels Model¶

We will now perform linear regression using statsmodels, a Python module that provides functions for the estimation of many statistical models, as well as for conducting statistical tests, and statistical data exploration.

Using statsmodels, we will be able to check the statistical validity of our model.

Model 1: Olsmodel, df , x_train¶

In [162]:
# Statsmodel api does not add a constant by default. We need to add it explicitly.
x_train = sm.add_constant(x_train)
# Add constant to test data
x_test = sm.add_constant(x_test)


def build_ols_model(train):
    # Create the model
    olsmodel = sm.OLS(y_train["price_log"], train)
    return olsmodel.fit()


olsmodel = build_ols_model(x_train)
print(olsmodel.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              price_log   R-squared:                       0.825
Model:                            OLS   Adj. R-squared:                  0.823
Method:                 Least Squares   F-statistic:                     429.1
Date:                Fri, 20 Aug 2021   Prob (F-statistic):               0.00
Time:                        06:55:51   Log-Likelihood:                -1622.4
No. Observations:                5072   AIC:                             3357.
Df Residuals:                    5016   BIC:                             3723.
Df Model:                          55                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                      -181.1233      4.129    -43.867      0.000    -189.218    -173.029
Year                          0.0905      0.002     44.027      0.000       0.086       0.094
Kilometers_Driven         -1.278e-07   1.56e-07     -0.819      0.413   -4.34e-07    1.78e-07
Power                         0.0022      0.000      8.408      0.000       0.002       0.003
Seats                         0.0079      0.009      0.846      0.397      -0.010       0.026
Engine_num                    0.0002    2.5e-05      6.985      0.000       0.000       0.000
Mileage_num                  -0.0142      0.002     -6.406      0.000      -0.019      -0.010
Location_Bangalore            0.0686      0.031      2.188      0.029       0.007       0.130
Location_Chennai              0.0016      0.029      0.055      0.956      -0.056       0.059
Location_Coimbatore           0.0521      0.029      1.822      0.068      -0.004       0.108
Location_Delhi               -0.0456      0.029     -1.580      0.114      -0.102       0.011
Location_Hyderabad            0.0641      0.028      2.303      0.021       0.010       0.119
Location_Jaipur              -0.0785      0.030     -2.586      0.010      -0.138      -0.019
Location_Kochi               -0.0412      0.029     -1.443      0.149      -0.097       0.015
Location_Kolkata             -0.1416      0.029     -4.873      0.000      -0.199      -0.085
Location_Mumbai               0.0017      0.028      0.060      0.952      -0.053       0.056
Location_Pune                -0.0396      0.028     -1.393      0.164      -0.095       0.016
Fuel_Type_Diesel              0.2668      0.051      5.269      0.000       0.168       0.366
Fuel_Type_LPG                 0.1452      0.137      1.061      0.289      -0.123       0.413
Fuel_Type_Petrol              0.0691      0.052      1.338      0.181      -0.032       0.170
Transmission_Manual          -0.0779      0.016     -4.910      0.000      -0.109      -0.047
Owner_Type_Fourth & Above     0.1026      0.113      0.909      0.363      -0.119       0.324
Owner_Type_Second            -0.0519      0.014     -3.709      0.000      -0.079      -0.024
Owner_Type_Third             -0.1546      0.035     -4.425      0.000      -0.223      -0.086
Brand_Audi                    0.4112      0.339      1.212      0.226      -0.254       1.076
Brand_Bentley                 0.2232      0.485      0.460      0.646      -0.728       1.175
Brand_Bmw                     0.4911      0.340      1.446      0.148      -0.175       1.157
Brand_Chevrolet               0.0012      0.339      0.004      0.997      -0.664       0.666
Brand_Datsun                 -0.0279      0.350     -0.080      0.936      -0.714       0.658
Brand_Fiat                    0.1097      0.344      0.319      0.750      -0.564       0.784
Brand_Force                   0.5887      0.413      1.426      0.154      -0.221       1.398
Brand_Ford                    0.1873      0.338      0.554      0.580      -0.476       0.850
Brand_Hindustan            3.501e-13    1.8e-13      1.948      0.051   -2.25e-15    7.03e-13
Brand_Honda                   0.3377      0.338      0.998      0.318      -0.326       1.001
Brand_Hyundai                 0.2732      0.338      0.808      0.419      -0.389       0.936
Brand_Isuzu                  -0.1953      0.378     -0.517      0.605      -0.935       0.545
Brand_Jaguar                  0.3332      0.344      0.967      0.333      -0.342       1.008
Brand_Jeep                    0.1390      0.350      0.397      0.691      -0.548       0.826
Brand_Lamborghini         -1.246e-13   6.59e-14     -1.891      0.059   -2.54e-13     4.6e-15
Brand_Land                    0.5720      0.342      1.675      0.094      -0.098       1.242
Brand_Mahindra                0.1977      0.339      0.584      0.559      -0.466       0.862
Brand_Maruti                  0.2595      0.338      0.768      0.443      -0.403       0.922
Brand_Mercedes-Benz           0.4446      0.339      1.312      0.190      -0.220       1.109
Brand_Mini                    0.5984      0.346      1.728      0.084      -0.080       1.277
Brand_Mitsubishi              0.3121      0.343      0.910      0.363      -0.360       0.984
Brand_Nissan                  0.2680      0.340      0.789      0.430      -0.398       0.934
Brand_Opelcorsa               1.1978      0.476      2.517      0.012       0.265       2.131
Brand_Porsche                 0.2640      0.351      0.751      0.453      -0.425       0.953
Brand_Renault                 0.2650      0.339      0.781      0.435      -0.400       0.930
Brand_Skoda                   0.3118      0.339      0.920      0.358      -0.353       0.976
Brand_Smart                   0.2993      0.477      0.627      0.530      -0.636       1.234
Brand_Tata                   -0.1567      0.339     -0.463      0.644      -0.821       0.507
Brand_Toyota                  0.3455      0.338      1.021      0.307      -0.318       1.009
Brand_Volkswagen              0.2517      0.338      0.744      0.457      -0.411       0.915
Brand_Volvo                   0.2360      0.347      0.680      0.497      -0.444       0.916
Car_category_Mid_Range        0.7006      0.022     32.076      0.000       0.658       0.743
Car_category_Luxury_Cars      0.9806      0.031     31.305      0.000       0.919       1.042
Car_category_Ultra_luxury     1.1911      0.051     23.558      0.000       1.092       1.290
==============================================================================
Omnibus:                      192.923   Durbin-Watson:                   2.012
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              554.732
Skew:                          -0.100   Prob(JB):                    3.48e-121
Kurtosis:                       4.608   Cond. No.                     1.20e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.71e-19. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Interpreting the Regression¶

Notes from GreatLearning:

  1. Adjusted. R-squared: It reflects the fit of the model.
    • Adjusted R-squared values generally range from 0 to 1, where a higher value generally indicates a better fit, assuming certain conditions are met.
    • In our case, the value for Adj. R-squared is 0.82, which is good.
  1. const coefficient: It is the Y-intercept.
    • It means that if all the dependent variable (features like Brand, Engine, Seats and so on) coefficients are zero, then the expected output (i.e., Y) would be equal to the const coefficient.
    • In our case, the value for const coefficient is -23.6193
  1. MOST IMPORATNT THING! Coefficient of an independent variable: It represents the change in the output Y due to a change in the independent variable (everything else held constant).
  • Negative values of the coefficient show that Price decreases with the increase of corresponding attribute value. These are: Seats, Kilometers_Driven, and others.

  • Positive values of the coefficient show that Price increases with the increase of corresponding attribute value. These are: Year,Mileage`, and others.

  1. std err: It reflects the level of accuracy of the coefficients.
    • The lower it is, the higher is the level of accuracy. We see that those cars with few records have lower scores.
  1. P>|t|: It is p-value. P is low, not = 0, higher significance.
  • p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.

    • For each independent feature, there is a null hypothesis and an alternate hypothesis. Here $\beta_i$ is the coefficient of the $i$th independent variable.

      • $H_o$ : Independent feature is not significant ($\beta_i = 0$)
      • $H_a$ : Independent feature is that it is significant ($\beta_i \neq 0$)
    • (P>|t|) gives the p-value for each independent feature to check that null hypothesis. We are considering 0.05 (5%) as significance level.

      • A p-value of less than 0.05 is considered to be statistically significant.
      • But these variables might contain multicollinearity, which will affect the p-values.
  1. Confidence Interval: It represents the range in which our coefficients are likely to fall (with a likelihood of 95%).

Observations:

  • Our adjusted R^2 is 0.82 which accounts for ~82% of the error variance, which is good.
  • There a number of high (above 0.05) p-values noticed.
  • The constant coefficient is low at ~-181 points for every change in the dependent variable.

We have build a linear regression model which shows multicolinearity and is highly accurate.

But this is not our final model. We have to check the statistical validity of the model, and also make sure it satisfies the assumptions of linear regression.

Inferential Model Performance¶

Evaluate the model on different performance metrics and comment on the performance and scope of improvement

Let's check the performance of the model using different metrics (MAE, MAPE, RMSE, $R^2$).

  • We will be using metric functions defined in sklearn for RMSE, MAE, and $R^2$.
  • We will define a function to calculate MAPE.
  • We will create a function which will print out all the above metrics in one go.
In [163]:
# defining function for MAPE from GreatLearning
import math

# RMSE
def rmse(predictions, targets):
    return np.sqrt(((targets - predictions) ** 2).mean())


# MAPE
def mape(predictions, targets):
    return np.mean(np.abs((targets - predictions)) / targets) * 100


# MAE
def mae(predictions, targets):
    return np.mean(np.abs((targets - predictions)))


# Model Performance on test and train data
def model_pref(olsmodel, x_train, x_test):

    # Insample Prediction
    y_pred_train_pricelog = olsmodel.predict(x_train)
    y_pred_train_Price = y_pred_train_pricelog
    y_train_Price = y_train["price_log"]

    # Prediction on test data
    y_pred_test_pricelog = olsmodel.predict(x_test)
    y_pred_test_Price = y_pred_test_pricelog
    y_test_Price = y_test["price_log"]

    print(
        pd.DataFrame(
            {
                "Data": ["Train", "Test"],
                "RMSE": [
                    rmse(y_pred_train_Price, y_train_Price),
                    rmse(y_pred_test_Price, y_test_Price),
                ],
                "MAE": [
                    mae(y_pred_train_Price, y_train_Price),
                    mae(y_pred_test_Price, y_test_Price),
                ],
                "MAPE": [
                    mape(y_pred_train_Price, y_train_Price),
                    mape(y_pred_test_Price, y_test_Price),
                ],
            }
        )
    )
  • The mean absolute percentage error (MAPE) measures the accuracy of predictions as a percentage, and can be calculated as the average absolute percent error for each predicted value minus actual values divided by actual values. It works best if there are no extreme values in the data and none of the actual values are 0.
  • The mean absolute error (MAE) is the simplest regression error metric to understand. We'll calculate the residual for every data point, taking only the absolute value of each so that negative and positive residuals do not cancel out. We then take the average of all these residuals. Effectively, MAE describes the typical magnitude of the residuals.
  • The root mean square error (RMSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value, and then takes the square root of the value.
  • $R^2$ (or coefficient of determination) represents the proportion of variance (of y) that has been explained by the independent variables in the model. It provides an indication of goodness of fit of the model. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value (or mean) of y, disregarding the input features, would get a $R^2$ of 0.0.
In [164]:
# Checking model performance
model_pref(olsmodel, x_train, x_test)
    Data  RMSE  MAE  MAPE
0  Train  0.33 0.25   inf
1   Test  0.33 0.25   inf

Observations

  • Root Mean Squared Error of train and test data is slightly different, indicating that our model is overfitting the test data.

  • Mean Absolute Error indicates that our current model is able to predict used cars prices within mean error of .25(log-odds) lakhs on test data.

  • Mean Absolute Percentage Error is inf because it is dividing by 0 in some cases.

The units of both RMSE and MAE are same - Lakhs in this case. But RMSE is greater than MAE because it peanalises the outliers more.

Statistics: Linear Regression Assumption Tests¶

We need to deal with multicollinearity and check the other assumptions of linear regression first, and then look at the p-values.

Perform tests for the assumptions of the linear regression. We will be checking the following Linear Regression assumptions to prevent error in inference or insights:

1. No Multicollinearity (correlation is too tight between two variables and inflates the co-efficient)

2. Mean of residuals should be 0

3. No Heteroscedasticity

4. Linearity of variables

5. Normality of error terms (important for calculating co-efficients)

No Multicollinearity¶

  • Multicollinearity occurs when predictor variables in a regression model are correlated. This correlation is a problem because predictor variables should be independent. If the correlation between variables is high, it can cause problems when we fit the model and interpret the results. When we have multicollinearity the linear model, The coefficients that the model suggests are unreliable.
  • There are different ways of detecting (or testing) multicollinearity. One such way is by using Variation Inflation Factor.
  • Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression parameter estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient $\beta_k$ is "inflated" by the existence of correlation among the predictor variables in the model.
  • General Rule of thumb: If VIF is 1 then there is no correlation among the $k$th predictor and the remaining predictor variables, and hence, the variance of $\beta_k$ is not inflated at all. Whereas, if VIF exceeds 5 or is close to exceeding 5, we say there is moderate multicollinearity, and if it is 10 or exceeding 10, it shows signs of high multicollinearity.
In [165]:
def checking_vif(train):
    vif = pd.DataFrame()
    vif["feature"] = train.columns

    # calculating VIF for each feature
    vif["VIF"] = [
        variance_inflation_factor(train.values, i) for i in range(len(train.columns))
    ]
    return vif


# print(checking_vif(x_train))

Features having a VIF score >5 will be dropped/treated till all the features have a VIF score <5

In [166]:
# Check VIF
checking_vif(x_train)
Out[166]:
feature VIF
0 const 770317.64
1 Year 2.04
2 Kilometers_Driven 1.62
3 Power 8.78
4 Seats 2.53
5 Engine_num 10.05
6 Mileage_num 3.89
7 Location_Bangalore 2.42
8 Location_Chennai 3.03
9 Location_Coimbatore 3.48
10 Location_Delhi 3.10
11 Location_Hyderabad 3.76
12 Location_Jaipur 2.70
13 Location_Kochi 3.48
14 Location_Kolkata 3.10
15 Location_Mumbai 3.91
16 Location_Pune 3.52
17 Fuel_Type_Diesel 28.85
18 Fuel_Type_LPG 1.17
19 Fuel_Type_Petrol 29.89
20 Transmission_Manual 2.29
21 Owner_Type_Fourth & Above 1.02
22 Owner_Type_Second 1.19
23 Owner_Type_Third 1.14
24 Brand_Audi 187.56
25 Brand_Bentley 2.10
26 Brand_Bmw 219.86
27 Brand_Chevrolet 100.45
28 Brand_Datsun 15.23
29 Brand_Fiat 27.25
30 Brand_Force 3.04
31 Brand_Ford 238.45
32 Brand_Hindustan NaN
33 Brand_Honda 469.29
34 Brand_Hyundai 763.84
35 Brand_Isuzu 5.08
36 Brand_Jaguar 31.52
37 Brand_Jeep 15.26
38 Brand_Lamborghini NaN
39 Brand_Land 52.47
40 Brand_Mahindra 215.02
41 Brand_Maruti 841.10
42 Brand_Mercedes-Benz 254.28
43 Brand_Mini 22.33
44 Brand_Mitsubishi 32.25
45 Brand_Nissan 83.89
46 Brand_Opelcorsa 2.02
47 Brand_Porsche 16.46
48 Brand_Renault 112.33
49 Brand_Skoda 157.59
50 Brand_Smart 2.03
51 Brand_Tata 158.35
52 Brand_Toyota 337.59
53 Brand_Volkswagen 253.25
54 Brand_Volvo 20.31
55 Car_category_Mid_Range 1.66
56 Car_category_Luxury_Cars 1.57
57 Car_category_Ultra_luxury 1.42
  • Engine, others, and Power has a VIF score of much greater than 5. Clearly these 2 variables are correlated with each other.
  • Brand have lots of multicolinearity. This does seem to make intuitive sense because the number of CC's and bhp would have a significant overlap as size is related to horsepower since a large engine should in theory produce more power.

I will need to drop some columns to make this handle correctly.

Model 2: olsmodel1, df1,xtrain1¶

Handling large independent variables.¶

Since Brand is binned by economic categories that we created earlier lets drop them and create a lighter model to test.

In [167]:
# get a list of columns that do not have "Brand" in the title
cols = [c for c in df.columns if c.title()[:5] != "Brand"]
In [168]:
# create a new dataframe and read the head
df1 = df[cols]
df1.head()
Out[168]:
Year Kilometers_Driven Power Seats Engine_num Mileage_num price_log Location_Bangalore Location_Chennai Location_Coimbatore Location_Delhi Location_Hyderabad Location_Jaipur Location_Kochi Location_Kolkata Location_Mumbai Location_Pune Fuel_Type_Diesel Fuel_Type_LPG Fuel_Type_Petrol Transmission_Manual Owner_Type_Fourth & Above Owner_Type_Second Owner_Type_Third Car_category_Mid_Range Car_category_Luxury_Cars Car_category_Ultra_luxury
S.No.
0 2010 72000 58.16 5 998 26.60 0.56 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
1 2015 41000 126.20 5 1582 19.67 2.53 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0
2 2011 46000 88.70 5 1199 18.20 1.50 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
3 2012 87000 88.76 7 1248 20.77 1.79 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
4 2013 40670 140.80 5 1968 15.20 2.88 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0
In [169]:
# We will have to create the x and y datasets again
ind_vars = df1.drop(["price_log"], axis=1)
dep_var = df1[["price_log"]]

# Dummy encoding
ind_vars_num = encode_cat_vars(ind_vars)

# Splitting data into train and test
x_train1, x_test1, y_train1, y_test1 = train_test_split(
    ind_vars_num, dep_var, test_size=0.3, random_state=1
)

print("Number of rows in train data =", x_train1.shape[0])
print("Number of rows in train data =", x_test1.shape[0], "\n\n")

# Statsmodel api does not add a constant by default. We need to add it explicitly.
x_train1 = sm.add_constant(x_train1)
# Add constant to test data
x_test1 = sm.add_constant(x_test1)

# Fit linear model on new dataset and print the summary()
olsmodel1 = build_ols_model(x_train1)
print(olsmodel1.summary())
Number of rows in train data = 5072
Number of rows in train data = 2174 


                            OLS Regression Results                            
==============================================================================
Dep. Variable:              price_log   R-squared:                       0.808
Model:                            OLS   Adj. R-squared:                  0.807
Method:                 Least Squares   F-statistic:                     816.8
Date:                Fri, 20 Aug 2021   Prob (F-statistic):               0.00
Time:                        06:55:53   Log-Likelihood:                -1852.7
No. Observations:                5072   AIC:                             3759.
Df Residuals:                    5045   BIC:                             3936.
Df Model:                          26                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                      -173.6506      4.151    -41.837      0.000    -181.788    -165.514
Year                          0.0868      0.002     41.952      0.000       0.083       0.091
Kilometers_Driven         -1.178e-07   1.61e-07     -0.733      0.463   -4.33e-07    1.97e-07
Power                         0.0030      0.000     12.433      0.000       0.003       0.003
Seats                         0.0057      0.008      0.679      0.497      -0.011       0.022
Engine_num                    0.0002   2.34e-05      7.730      0.000       0.000       0.000
Mileage_num                  -0.0111      0.002     -5.220      0.000      -0.015      -0.007
Location_Bangalore            0.0652      0.033      1.999      0.046       0.001       0.129
Location_Chennai             -0.0180      0.031     -0.586      0.558      -0.078       0.042
Location_Coimbatore           0.0348      0.030      1.170      0.242      -0.024       0.093
Location_Delhi               -0.0365      0.030     -1.216      0.224      -0.095       0.022
Location_Hyderabad            0.0547      0.029      1.887      0.059      -0.002       0.111
Location_Jaipur              -0.1029      0.032     -3.260      0.001      -0.165      -0.041
Location_Kochi               -0.0409      0.030     -1.378      0.168      -0.099       0.017
Location_Kolkata             -0.1574      0.030     -5.206      0.000      -0.217      -0.098
Location_Mumbai              -0.0005      0.029     -0.017      0.986      -0.057       0.056
Location_Pune                -0.0604      0.030     -2.044      0.041      -0.118      -0.002
Fuel_Type_Diesel              0.2572      0.052      4.901      0.000       0.154       0.360
Fuel_Type_LPG                 0.1854      0.143      1.300      0.194      -0.094       0.465
Fuel_Type_Petrol              0.0802      0.053      1.501      0.133      -0.025       0.185
Transmission_Manual          -0.1355      0.015     -8.852      0.000      -0.166      -0.106
Owner_Type_Fourth & Above     0.0585      0.118      0.498      0.619      -0.172       0.289
Owner_Type_Second            -0.0587      0.015     -4.038      0.000      -0.087      -0.030
Owner_Type_Third             -0.1795      0.036     -4.957      0.000      -0.250      -0.109
Car_category_Mid_Range        0.7581      0.021     35.621      0.000       0.716       0.800
Car_category_Luxury_Cars      1.0504      0.031     34.280      0.000       0.990       1.110
Car_category_Ultra_luxury     1.2135      0.050     24.171      0.000       1.115       1.312
==============================================================================
Omnibus:                      254.478   Durbin-Watson:                   2.009
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              708.716
Skew:                          -0.242   Prob(JB):                    1.27e-154
Kurtosis:                       4.766   Cond. No.                     5.89e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.89e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

The R squared and adjusted r squared values have decreased which indicates that we have been able to capture less of the information of the previous model after reducing the number of predictor features.

As we try to decrease overfitting, the r squared of our train model is expected to decrease.

Check VIF¶

In [170]:
# Check VIF
print(checking_vif(x_train1))
                      feature       VIF
0                       const 714972.56
1                        Year      1.90
2           Kilometers_Driven      1.57
3                       Power      6.53
4                       Seats      1.89
5                  Engine_num      8.10
6                 Mileage_num      3.26
7          Location_Bangalore      2.41
8            Location_Chennai      3.00
9         Location_Coimbatore      3.46
10             Location_Delhi      3.09
11         Location_Hyderabad      3.73
12            Location_Jaipur      2.67
13             Location_Kochi      3.46
14           Location_Kolkata      3.09
15            Location_Mumbai      3.88
16              Location_Pune      3.50
17           Fuel_Type_Diesel     28.45
18              Fuel_Type_LPG      1.16
19           Fuel_Type_Petrol     29.39
20        Transmission_Manual      1.96
21  Owner_Type_Fourth & Above      1.02
22          Owner_Type_Second      1.18
23           Owner_Type_Third      1.12
24     Car_category_Mid_Range      1.45
25   Car_category_Luxury_Cars      1.38
26  Car_category_Ultra_luxury      1.28

We have a large number of high VIF scores that need treatment.

Check Model Performance¶

In [171]:
# Checking model performance
model_pref(olsmodel1, x_train1, x_test1)  # No Overfitting.
    Data  RMSE  MAE  MAPE
0  Train  0.35 0.26   inf
1   Test  0.33 0.26   inf
  • The RMSE has increased on both test and training sets meaning residual distance has increased.
  • The RMSE values not being closer indicates model fit as same as last model (+2points) and not overfitting.
  • MAE increased on training and test compared to the last model meaning our typical magnitude of the residuals has increased.

Model 3: df2, x_train2-3, Dropping Engine_num and others¶

Removing Multicollinearity¶

To remove multicollinearity

  1. Drop every column one by one that has VIF score greater than 5.
  2. Look at the adjusted R-squared of all these models.
  3. Drop the variable that makes least change in adjusted R-squared.
  4. Check the VIF scores again.
  5. Continue till you get all VIF scores under 5.

Our highest VIF offenders are in order:

Fuel_Type_Petrol 29.39

Fuel_Type_Diesel 28.45

Engine_num 8.10

In [173]:
# get a column of column names
high_vif_columns = [
    "Engine_num",
    "Fuel_Type_Diesel",
    "Fuel_Type_Petrol",
]

# input the last model's test and train df's into the function
treating_multicollinearity(high_vif_columns, x_train1, x_test1)
                col  Adj_rsq_after_dropping_col  Test RMSE
2  Fuel_Type_Petrol                        0.81       0.33
1  Fuel_Type_Diesel                        0.81       0.34
0        Engine_num                        0.80       0.34



Our Adjusted R-squared stays the same if we remove Engine. This is a clear indication that we have been able to create a very good model that is able to explain variance in price of used cars with up to 81% accurary.

However, RMSE has increased if we remove Fuel Type meaning the gap in our residuals has grown and we want this to be as low as possible to be accurate in how the model predicts the response.

"Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction."Source

Let's drop one just to see.

DropFuel_Type_Petrol

In [174]:
# we drop the one with the highest vif values and check the adjusted R-squared
x_train2 = x_train1.drop("Fuel_Type_Petrol", axis=1)
x_train2.head()
Out[174]:
const Year Kilometers_Driven Power Seats Engine_num Mileage_num Location_Bangalore Location_Chennai Location_Coimbatore Location_Delhi Location_Hyderabad Location_Jaipur Location_Kochi Location_Kolkata Location_Mumbai Location_Pune Fuel_Type_Diesel Fuel_Type_LPG Transmission_Manual Owner_Type_Fourth & Above Owner_Type_Second Owner_Type_Third Car_category_Mid_Range Car_category_Luxury_Cars Car_category_Ultra_luxury
S.No.
2080 1.00 2014 71000 126.32 5 1582 19.08 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0
62 1.00 2015 58000 186.00 5 1796 11.74 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
6978 1.00 2008 131000 105.00 5 1896 15.00 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0
3502 1.00 2015 54339 88.73 5 1396 22.54 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0
2156 1.00 2014 105044 100.60 7 2494 12.99 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0
In [175]:
# create VIF scores
vif_series2 = pd.Series(
    [variance_inflation_factor(x_train2.values, i) for i in range(x_train2.shape[1])],
    index=x_train2.columns,
)
print("VIF Scores: \n\n{}\n".format(vif_series2))
VIF Scores: 

const                       714843.18
Year                             1.90
Kilometers_Driven                1.57
Power                            6.52
Seats                            1.89
Engine_num                       7.99
Mileage_num                      3.08
Location_Bangalore               2.41
Location_Chennai                 3.00
Location_Coimbatore              3.46
Location_Delhi                   3.09
Location_Hyderabad               3.73
Location_Jaipur                  2.67
Location_Kochi                   3.46
Location_Kolkata                 3.09
Location_Mumbai                  3.88
Location_Pune                    3.50
Fuel_Type_Diesel                 2.11
Fuel_Type_LPG                    1.01
Transmission_Manual              1.96
Owner_Type_Fourth & Above        1.02
Owner_Type_Second                1.18
Owner_Type_Third                 1.12
Car_category_Mid_Range           1.45
Car_category_Luxury_Cars         1.38
Car_category_Ultra_luxury        1.28
dtype: float64

We have lowered the Fuel_Type_Diesel VIF and brought them closer to 5. Now we have Power and Engine_num to handle.

Create Model 3 and score

In [176]:
# we drop the one with the highest vif values and check the adjusted R-squared
high_vif_columns = [
    "Engine_num",
    # "Fuel_Type_Diesel",
    "Fuel_Type_Petrol",
    "Power",
]

# create a new dataframe and read the head
df2 = df1.drop(high_vif_columns, axis=1)
df2.head()

# We will have to create the x and y datasets again
ind_vars = df2.drop(["price_log"], axis=1)
dep_var = df2[["price_log"]]

# Dummy encoding
ind_vars_num = encode_cat_vars(ind_vars)

# Splitting data into train and test
x_train3, x_test3, y_train3, y_test3 = train_test_split(
    ind_vars_num, dep_var, test_size=0.3, random_state=1
)

# Statsmodel api does not add a constant by default. We need to add it explicitly.
x_train3 = sm.add_constant(x_train3)
# Add constant to test data
x_test3 = sm.add_constant(x_test3)
In [177]:
# create VIF scores
vif_series3 = pd.Series(
    [variance_inflation_factor(x_train3.values, i) for i in range(x_train3.shape[1])],
    index=x_train3.columns,
)
print("VIF Scores: \n\n{}\n".format(vif_series3))
VIF Scores: 

const                       714213.66
Year                             1.89
Kilometers_Driven                1.56
Seats                            1.58
Mileage_num                      2.01
Location_Bangalore               2.41
Location_Chennai                 3.00
Location_Coimbatore              3.46
Location_Delhi                   3.09
Location_Hyderabad               3.73
Location_Jaipur                  2.67
Location_Kochi                   3.45
Location_Kolkata                 3.08
Location_Mumbai                  3.88
Location_Pune                    3.49
Fuel_Type_Diesel                 1.56
Fuel_Type_LPG                    1.01
Transmission_Manual              1.68
Owner_Type_Fourth & Above        1.01
Owner_Type_Second                1.18
Owner_Type_Third                 1.12
Car_category_Mid_Range           1.38
Car_category_Luxury_Cars         1.26
Car_category_Ultra_luxury        1.12
dtype: float64

We passed the Multicolinearity Test and can create our new model.

In [178]:
print("Number of rows in train data =", x_train3.shape[0])
print("Number of rows in train data =", x_test3.shape[0], "\n\n")

# Fit linear model on new dataset and print the summary()
olsmodel2 = build_ols_model(x_train3)
print(olsmodel2.summary())
Number of rows in train data = 5072
Number of rows in train data = 2174 


                            OLS Regression Results                            
==============================================================================
Dep. Variable:              price_log   R-squared:                       0.781
Model:                            OLS   Adj. R-squared:                  0.780
Method:                 Least Squares   F-statistic:                     782.9
Date:                Fri, 20 Aug 2021   Prob (F-statistic):               0.00
Time:                        06:55:54   Log-Likelihood:                -2186.4
No. Observations:                5072   AIC:                             4421.
Df Residuals:                    5048   BIC:                             4577.
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                      -175.8018      4.429    -39.692      0.000    -184.485    -167.119
Year                          0.0885      0.002     40.108      0.000       0.084       0.093
Kilometers_Driven          8.531e-09   1.71e-07      0.050      0.960   -3.27e-07    3.44e-07
Seats                         0.0001      0.008      0.012      0.990      -0.016       0.016
Mileage_num                  -0.0410      0.002    -23.040      0.000      -0.044      -0.037
Location_Bangalore            0.0625      0.035      1.795      0.073      -0.006       0.131
Location_Chennai             -0.0400      0.033     -1.224      0.221      -0.104       0.024
Location_Coimbatore           0.0157      0.032      0.496      0.620      -0.046       0.078
Location_Delhi               -0.0258      0.032     -0.803      0.422      -0.089       0.037
Location_Hyderabad            0.0330      0.031      1.067      0.286      -0.028       0.094
Location_Jaipur              -0.1370      0.034     -4.073      0.000      -0.203      -0.071
Location_Kochi               -0.0344      0.032     -1.085      0.278      -0.097       0.028
Location_Kolkata             -0.1577      0.032     -4.887      0.000      -0.221      -0.094
Location_Mumbai               0.0063      0.031      0.204      0.838      -0.054       0.067
Location_Pune                -0.0892      0.032     -2.830      0.005      -0.151      -0.027
Fuel_Type_Diesel              0.3395      0.013     25.876      0.000       0.314       0.365
Fuel_Type_LPG                 0.0739      0.142      0.521      0.602      -0.204       0.352
Transmission_Manual          -0.2747      0.015    -18.161      0.000      -0.304      -0.245
Owner_Type_Fourth & Above    -0.0110      0.125     -0.088      0.930      -0.257       0.235
Owner_Type_Second            -0.0656      0.016     -4.226      0.000      -0.096      -0.035
Owner_Type_Third             -0.1759      0.039     -4.551      0.000      -0.252      -0.100
Car_category_Mid_Range        0.8769      0.022     39.618      0.000       0.834       0.920
Car_category_Luxury_Cars      1.2727      0.031     40.585      0.000       1.211       1.334
Car_category_Ultra_luxury     1.6833      0.050     33.613      0.000       1.585       1.781
==============================================================================
Omnibus:                      115.791   Durbin-Watson:                   2.014
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              202.525
Skew:                          -0.188   Prob(JB):                     1.05e-44
Kurtosis:                       3.904   Cond. No.                     5.88e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.88e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
In [179]:
# Checking model performance
model_pref(olsmodel2, x_train3, x_test3)  # No Overfitting.
    Data  RMSE  MAE  MAPE
0  Train  0.37 0.29   inf
1   Test  0.37 0.29   inf

Adj. R-squared: decreased 2% from the previous model as expected but is still a respectable 78% accuracy score.

RMSE:: Increased on training and test saying that our fit has decreased. The training and test sets have moved closer to each other indicating that the model is not overfitting the training data anymore. But RMSE is greater than MAE because it peanalises the outliers more.

MAE:: Increased on training and test saying that our fit has decreased. Meaning that our current model is able to predict used cars prices within mean error of 29% lakhs on test data.

MAPE: Has 0's in it so it returns an error.

Model 4: df3, x_train4, dropping all high p-values¶

Increase statistical significance¶

Remove those variable where p-value is above 0.05 and re-run the model, thus making the A-R^2 GO UP (reduced error).

In [180]:
# Drop 'Kilometers_Driven_log' from train and test
cols_to_drop = ["Kilometers_Driven","Location_Hyderabad","Fuel_Type_LPG","Location_Mumbai","Owner_Type_Fourth & Above","Location_Kochi","Location_Bangalore", "Location_Chennai", "Location_Coimbatore", "Location_Delhi", "Mileage_num"] #"Kilometers_Driven_log"

# we drop the one with the highest vif values and check the adjusted R-squared
x_train4 = x_train3.drop(cols_to_drop, axis=1)


# lock the new data frame
#x_train3 = x_train2.loc[:, ~x_train2.columns.str.startswith(col_to_drop)]
#x_test3 = x_test2.loc[:, ~x_test2.columns.str.startswith(col_to_drop)]
In [181]:
x_train4.head()
Out[181]:
const Year Seats Location_Jaipur Location_Kolkata Location_Pune Fuel_Type_Diesel Transmission_Manual Owner_Type_Second Owner_Type_Third Car_category_Mid_Range Car_category_Luxury_Cars Car_category_Ultra_luxury
S.No.
2080 1.00 2014 5 0 0 0 1 0 1 0 0 0 0
62 1.00 2015 5 0 0 0 0 0 0 0 1 0 0
6978 1.00 2008 5 0 0 0 1 1 0 0 0 0 0
3502 1.00 2015 5 0 0 1 1 1 1 0 0 0 0
2156 1.00 2014 7 1 0 0 1 1 0 0 0 0 0
In [182]:
# create VIF scores
vif_series3 = pd.Series(
    [variance_inflation_factor(x_train4.values, i) for i in range(x_train4.shape[1])],
    index=x_train4.columns,
)
print("VIF Scores: \n\n{}\n".format(vif_series3))
VIF Scores: 

const                       478325.68
Year                             1.26
Seats                            1.12
Location_Jaipur                  1.04
Location_Kolkata                 1.05
Location_Pune                    1.06
Fuel_Type_Diesel                 1.20
Transmission_Manual              1.45
Owner_Type_Second                1.15
Owner_Type_Third                 1.10
Car_category_Mid_Range           1.31
Car_category_Luxury_Cars         1.21
Car_category_Ultra_luxury        1.08
dtype: float64

In [183]:
print("Number of rows in train data =", x_train4.shape[0])
# print("Number of rows in train data =", x_test4.shape[0], "\n\n")

# Fit linear model on new dataset and print the summary()
olsmodel4 = build_ols_model(x_train4)
print(olsmodel4.summary())
Number of rows in train data = 5072
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              price_log   R-squared:                       0.757
Model:                            OLS   Adj. R-squared:                  0.756
Method:                 Least Squares   F-statistic:                     1310.
Date:                Fri, 20 Aug 2021   Prob (F-statistic):               0.00
Time:                        06:55:54   Log-Likelihood:                -2455.8
No. Observations:                5072   AIC:                             4938.
Df Residuals:                    5059   BIC:                             5022.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                      -139.6884      3.818    -36.584      0.000    -147.174    -132.203
Year                          0.0700      0.002     36.948      0.000       0.066       0.074
Seats                         0.1005      0.007     13.816      0.000       0.086       0.115
Location_Jaipur              -0.1543      0.022     -6.959      0.000      -0.198      -0.111
Location_Kolkata             -0.1679      0.020     -8.453      0.000      -0.207      -0.129
Location_Pune                -0.1017      0.018     -5.569      0.000      -0.138      -0.066
Fuel_Type_Diesel              0.2352      0.012     19.383      0.000       0.211       0.259
Transmission_Manual          -0.4019      0.015    -27.147      0.000      -0.431      -0.373
Owner_Type_Second            -0.0538      0.016     -3.334      0.001      -0.085      -0.022
Owner_Type_Third             -0.1939      0.040     -4.812      0.000      -0.273      -0.115
Car_category_Mid_Range        0.9782      0.023     42.935      0.000       0.934       1.023
Car_category_Luxury_Cars      1.3994      0.032     43.309      0.000       1.336       1.463
Car_category_Ultra_luxury     1.8896      0.052     36.527      0.000       1.788       1.991
==============================================================================
Omnibus:                       44.173   Durbin-Watson:                   2.011
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               69.903
Skew:                          -0.027   Prob(JB):                     6.62e-16
Kurtosis:                       3.573   Cond. No.                     1.39e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.39e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Observations

  • We have removed multicolinearity.

  • We can see adj. R-squared for olsmod1 is 0.807 and olsmod4 is 0.799, a reduction after removing Power vs Engine and others and thus olsmod4 will be what we move foward with.

  • Now the above model has insignificant variables.

Now no feature has p-value greater than 0.05, so we'll consider the features in X_train4 as the final ones and olsres4 as final model.

Mean of the Residuals should be 0¶

Shows that the line of best fit is accurately drawn.

In [184]:
# get the residual factors and then the mean
residuals = olsmodel4.resid
np.mean(residuals)
Out[184]:
-1.294271065190537e-11
  • Mean of redisuals is very close to 0. The second assumption is also satisfied.

No Heteroscedasticity¶

  • Test - goldfeldquandt test

  • Homoscedacity: If the variance of the residuals are symmetrically distributed across the regression line, then the data is said to homoscedastic.

  • Heteroscedacity: If the variance is unequal for the residuals across the regression line, then the data is said to be heteroscedastic. In this case the residuals can form an arrow shape or any other non symmetrical shape.

For goldfeldquandt test, the null and alternate hypotheses are as follows:

  • Null hypothesis : Residuals are homoscedastic
  • Alternate hypothesis : Residuals have heteroscedasticity

alpha = 0.05

In [185]:
# create goldfeldquandt test
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(residuals, x_train3)
lzip(name, test)
Out[185]:
[('F statistic', 1.0562479984087707), ('p-value', 0.08516681687499524)]

Since p-value > 0.05 we cannot reject the Null Hypothesis that the residuals are homoscedastic.

Assumptions 3 is also satisfied by our olsmodel3.

Since p-value > 0.05, we can say that the residuals are homoscedastic. This assumption is therefore met in the data.

Linearity of Variables¶

Why the test?

  • Linearity describes a straight-line relationship between two variables, predictor variables must have a linear relation with the dependent variable.

How to check linearity?

  • Make a plot of fitted values vs residuals, if they don't follow any pattern, they we say the model is linear, otherwise model is showing signs of non-linearity.

How to fix if this assumption is not followed?

  • We can try to transform the variables and make the relationships linear.
In [186]:
# get the residual factors and then the mean
residual = olsmodel4.resid
fitted = olsmodel4.fittedvalues  # predicted values
In [187]:
# create a residual plot
sns.set_style("whitegrid")
sns.residplot(fitted, residual, color="purple", lowess=True)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Residual vs Fitted plot")
plt.show()
  • Scatter plot shows the distribution of residuals (errors) vs fitted values (predicted values).

  • If there exist any pattern in this plot, we consider it as signs of non-linearity in the data and a pattern means that the model doesn't capture non-linear effects.

  • We do see a negative linear pattern mixed with a uniform pattern so we will need to fix this but as I do not know the best way to do this I will continue.

Normality of Error Terms¶

What is the test?

  • Error terms/Residuals should be normally distributed

  • If the error terms are non- normally distributed, confidence intervals may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares.

What do non-normality indicate?

  • It suggests that there are a few unusual data points which must be studied closely to make a better model.

How to Check the Normality?

  • It can be checked via QQ Plot, Residuals following normal distribution will make a straight line plot otherwise not.

  • Other test to check for normality : Shapiro-Wilk test.

What is the residuals are not-normal?

  • We can apply transformations like log, exponential, arcsinh, etc. as per our data.
In [188]:
# create a histogram of residuals
sns.distplot(residual)
plt.title("Normality of residuals")
plt.show()

QQPlot¶

The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.

In [189]:
# create a probability plot
stats.probplot(residual, dist="norm", plot=pylab)
plt.show()

The residuals have a close to normal distribution. Assumption 5 is also satisfied. We should further investigate these values in the tails where we have made huge residual errors.

Now that we have seen that olsmodel3 follows all the linear regression assumptions. Let us use that model to draw inferences.

Shapiro Test¶

In [190]:
# get Shapiro test results
stats.shapiro(residual)
Out[190]:
ShapiroResult(statistic=0.996060311794281, pvalue=2.0631184260988533e-10)
Observations:¶
  • The residuals are normal as per visual test, but as per QQ plot they are approximately normal.
  • The issue with shapiro test is when dataset is big, even for small deviations, it shows data as not normal.
  • The Shapiro test outputs a high negative number which is below a p=0.05 value. No departure from normality is found.

Now we have checked all the assumptions and they are satisfied, so we can move towards the prediction part.

Best Model Predicting ¶

Model Prediction¶

Let us now make predictions on the test set (X_test) with the model, and compare the actual output values with the predicted values.

In [191]:
# create a new dataframe and read the head
df3 = df2.drop(labels=high_vif_columns and cols_to_drop, axis=1)
df3.head()

# We will have to create the x and y datasets again
ind_vars = df3.drop(["price_log"], axis=1)
dep_var = df3[["price_log"]]

# Dummy encoding
ind_vars_num = encode_cat_vars(ind_vars)

# Splitting data into train and test
x_train5, x_test5, y_train5, y_test5 = train_test_split(
    ind_vars_num, dep_var, test_size=0.3, random_state=1
)

# Statsmodel api does not add a constant by default. We need to add it explicitly.
x_train5 = sm.add_constant(x_train5)
# Add constant to test data
x_test5 = sm.add_constant(x_test5)
In [192]:
x_test5.shape
Out[192]:
(2174, 13)
In [193]:
x_test5.head()
Out[193]:
const Year Seats Location_Jaipur Location_Kolkata Location_Pune Fuel_Type_Diesel Transmission_Manual Owner_Type_Second Owner_Type_Third Car_category_Mid_Range Car_category_Luxury_Cars Car_category_Ultra_luxury
S.No.
5043 1.00 2014 5 0 0 0 0 1 0 0 0 0 0
5670 1.00 2014 5 0 1 0 1 1 0 0 0 0 0
2782 1.00 2014 5 1 0 0 1 1 0 0 0 0 0
3659 1.00 2015 5 0 0 0 0 0 0 0 0 0 0
2869 1.00 2014 5 0 0 0 1 1 0 0 0 0 0
In [194]:
y_test5.shape
Out[194]:
(2174, 1)
In [195]:
# create linear model on training data
reg = LinearRegression().fit(x_train5, y_train5)

reg.predict(x_test5)  # Predict using the linear model. # predictions on the test set

reg.score(
    x_test5, y_test5
)  # return R2 coefficient of determination  of the prediction. (0-1)
Out[195]:
0.7601503856141686
  • We can observe here that our model has returned pretty good prediction results (~76% accuracy), and the actual and predicted values are comparable.
In [196]:
# get df columns
x_train5.columns
Out[196]:
Index(['const', 'Year', 'Seats', 'Location_Jaipur', 'Location_Kolkata',
       'Location_Pune', 'Fuel_Type_Diesel', 'Transmission_Manual',
       'Owner_Type_Second', 'Owner_Type_Third', 'Car_category_Mid_Range',
       'Car_category_Luxury_Cars', 'Car_category_Ultra_luxury'],
      dtype='object')
In [197]:
# Get the range of the y-variable
cols = [np.exp(y_train5), np.exp(y_test5)]
for i in cols:
    display(i.max() - i.min())
price_log   96.63
dtype: float64
price_log   119.55
dtype: float64
In [198]:
# Selecting columns from test data that we used to create our final model
X_test_final = x_test5[x_train5.columns]
In [199]:
# look at the head
X_test_final.head()
Out[199]:
const Year Seats Location_Jaipur Location_Kolkata Location_Pune Fuel_Type_Diesel Transmission_Manual Owner_Type_Second Owner_Type_Third Car_category_Mid_Range Car_category_Luxury_Cars Car_category_Ultra_luxury
S.No.
5043 1.00 2014 5 0 0 0 0 1 0 0 0 0 0
5670 1.00 2014 5 0 1 0 1 1 0 0 0 0 0
2782 1.00 2014 5 1 0 0 1 1 0 0 0 0 0
3659 1.00 2015 5 0 0 0 0 0 0 0 0 0 0
2869 1.00 2014 5 0 0 0 1 1 0 0 0 0 0

Observations

  • Dependent variable is in a log-scale so we will have to interpret accordingly.
  • The low price is 96Lakhs and the high price is 119.55Lakhs on the test set.
In [206]:
# Checking model performance on test set (seen 70% data)
print("Test Performance\n")
model_pref(olsmodel4, x_train5, x_test5)
Test Performance

    Data  RMSE  MAE  MAPE
0  Train  0.39 0.31   inf
1   Test  0.39 0.30   inf

RMSE:: Increased on training and test from model3 saying that our fit has decreased.

MAE:: Decreased from model3 on training and test saying that our fit has increased. Meaning that our current model is able to predict used cars prices within mean error of .30 lakhs on test data.

MAPE: Has 0's in it so it returns an error.

Inferences¶

In [207]:
# let us print the model summary
olsmodel4.summary()
Out[207]:
OLS Regression Results
Dep. Variable: price_log R-squared: 0.757
Model: OLS Adj. R-squared: 0.756
Method: Least Squares F-statistic: 1310.
Date: Fri, 20 Aug 2021 Prob (F-statistic): 0.00
Time: 07:08:54 Log-Likelihood: -2455.8
No. Observations: 5072 AIC: 4938.
Df Residuals: 5059 BIC: 5022.
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -139.6884 3.818 -36.584 0.000 -147.174 -132.203
Year 0.0700 0.002 36.948 0.000 0.066 0.074
Seats 0.1005 0.007 13.816 0.000 0.086 0.115
Location_Jaipur -0.1543 0.022 -6.959 0.000 -0.198 -0.111
Location_Kolkata -0.1679 0.020 -8.453 0.000 -0.207 -0.129
Location_Pune -0.1017 0.018 -5.569 0.000 -0.138 -0.066
Fuel_Type_Diesel 0.2352 0.012 19.383 0.000 0.211 0.259
Transmission_Manual -0.4019 0.015 -27.147 0.000 -0.431 -0.373
Owner_Type_Second -0.0538 0.016 -3.334 0.001 -0.085 -0.022
Owner_Type_Third -0.1939 0.040 -4.812 0.000 -0.273 -0.115
Car_category_Mid_Range 0.9782 0.023 42.935 0.000 0.934 1.023
Car_category_Luxury_Cars 1.3994 0.032 43.309 0.000 1.336 1.463
Car_category_Ultra_luxury 1.8896 0.052 36.527 0.000 1.788 1.991
Omnibus: 44.173 Durbin-Watson: 2.011
Prob(Omnibus): 0.000 Jarque-Bera (JB): 69.903
Skew: -0.027 Prob(JB): 6.62e-16
Kurtosis: 3.573 Cond. No. 1.39e+06


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.39e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Insights and Recommendations¶

olsres4 is our final model which follows all the assumptions, and can be used for interpretations.

The general linear regression formula is: Y = B0 + B1X1 + B2X2 + e.

Our linear regression formula would be: Price=8.37789660802926E-74(consant) + 3.10588904098151(Car_category_Luxury_Cars)+ 2.25128237881789(Car_category_Mid-Range)+ 1.1304321576378(Fuel_Type_Diesel)+ 1.08784644134162(Year)+ 1.08231254791372(Fuel_Type_LPG)+ 1.04759807906853(Location_Hyderabad)+ 1.00611864288793(Location_Mumbai)+ 1.00040008001067(Engine_num)+ 0.982062821165706(Seats)+ 0.948759440367809(Owner_Type_Second)+ 0.937067463377403(Location_Pune)+ 0.898345985825023(Location_Jaipur)+ 0.856329540247798(Location_Kolkata)+ 0.838282603224233(Owner_Type_Third)+ 0.812207036711939(Transmission_Manual)+ 0.99999975690003(Kilometers_Driven)

  1. With our linear regression model we have been able to capture ~76% information in our data. $R^2$ is 0.79, i.e., the model variance of its errors is 79% less than the variance of the dependent variable. So overall, the model is satisfactory.

It is important to note here that correlation is not equal to causation. That is to say that increase in Mileage does not lead to a drop in prices. It can be understood in such a way that the cars with high mileage do not have a high power engine and therefore have low prices.

  1. The model indicates that the most significant predictors of price of used cars are -

     - `Car_category_Ultra_luxury`
     - `Car_category_Luxury_Cars`
     - `Car_category_Mid-Range`
     - `Fuel_Type_Diesel`
     - `Year`
    
     The following features have a positive influence on Price:
     - `Year`
     - `Power`
     - `Fuel_Type_Diesel`
    
     The following features have a negative influence on Price:
     - `Transmission_Manual`
     -`Seats`
     -`Owner_Type_Third`
    
     The features of greatest impact are :
     - `Transmission_Manual` - in the negative direction.           
     - `Car_category_Ultra_luxury` - in the positive direction.

It is important to note here that the predicted values are log(x) and therefore coefficients have to be converted accordingly to understand their influence in price.

"Only independent/predictor variable(s) is log-transformed. Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units. Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002. For x percent increase, multiply the coefficient by log(1.x). Example: For every 10% increase in the independent variable, our dependent variable increases by about 0.198 * log(1.10) = 0.02." Source

  1. Newer cars sell for higher prices. 1 unit increase in the year of manufacture leads to [ exp(0.1145) = 1.12 Lakh ] increase in the price of the vehicle, when everything else is constant.

  2. As the number of seats increases, the price of the car increases - exp(0.0145) = 1.01 Lakhs

  3. Kilometers Driven have a negative relationship with the price which is intuitive. A car that has been driven more will have more wear and tear and hence sell at a lower price, everything else being 0.

  4. The categorical variables are a little hard to interpret. But it can be seen that all the car_category variables in the dataset have a positive relationship with the Price and the magnitude of this positive relationship increases as the brand category moves to the luxury brands. It will not be incorrect to interpret that the dropped car_category variable for budget friendly cars would have a negative relationship with the price because the other 3 are increasingly positive.

Recommendations¶

  1. Work on getting a dataset with a significant amount of New Car prices to come up with a new model. Develop a strategy to harvest cleaner and clearer datasets on an on-going basis.

  2. Flag any Electric vehicles as high-cost and Hindustan brand as low-cost. Other Fuel_Types have similar effects to eachother and will tend to be lower cost. The other Makes/Brands will be higher costing but all about the same as they have the same effect on Price.

  3. With Automatic and Manuals having similar effects, you can safely disregard distinguishing them when giving recommendations.

  4. When looking to predict used car prices, use the price category data to navigate customers towards the prices they can expect to pay.

  5. Some southern markets tend to have higher prices. It might be a good strategy to plan growth in southern cities using this information. Markets like Kolkata(coeff = -0.15log) are very risky and we need to be careful about investments in this area.

Project Improvement¶

Notes¶

  1. I could have tried log transformations to make the independent more normal instead of the dependent.

  2. I could have removed categorical or numeric types/columns where only one record existed. This has a similar effect as outliers in that one variable can skew the inferences in a direction.

  3. I could have done a PCA feature analysis to reduce the number of input features which would possibily have changed the R^2 value and coefficients.

Add-On-Analysis¶

Analysising predictions where we were way off the mark.

In [209]:
# Extracting the rows from original data frame df where indexes are same as the training data of best model
original_df = df[df.index.isin(x_train5.index.values)].copy()

# Extracting predicted values from the final model
residuals = olsmodel4.resid
fitted_values = olsmodel4.fittedvalues

# Add new columns for predicted values
original_df["Predicted price_log "] = fitted_values
original_df["Predicted Price"] = fitted_values.apply(math.exp)
original_df["residuals"] = residuals
original_df["Abs_residuals"] = residuals.apply(math.exp)
original_df["Difference in Lakhs"] = np.abs(
    original_df["price_log"] - original_df["Predicted Price"]
)

# Let us look at the top 20 predictions where our model made highest extimation errors (on train data)
original_df.sort_values(by=["Difference in Lakhs"], ascending=False).head(100)
Out[209]:
Year Kilometers_Driven Power Seats Engine_num Mileage_num price_log Location_Bangalore Location_Chennai Location_Coimbatore Location_Delhi Location_Hyderabad Location_Jaipur Location_Kochi Location_Kolkata Location_Mumbai Location_Pune Fuel_Type_Diesel Fuel_Type_LPG Fuel_Type_Petrol Transmission_Manual Owner_Type_Fourth & Above Owner_Type_Second Owner_Type_Third Brand_Audi Brand_Bentley Brand_Bmw Brand_Chevrolet Brand_Datsun Brand_Fiat Brand_Force Brand_Ford Brand_Hindustan Brand_Honda Brand_Hyundai Brand_Isuzu Brand_Jaguar Brand_Jeep Brand_Lamborghini Brand_Land Brand_Mahindra Brand_Maruti Brand_Mercedes-Benz Brand_Mini Brand_Mitsubishi Brand_Nissan Brand_Opelcorsa Brand_Porsche Brand_Renault Brand_Skoda Brand_Smart Brand_Tata Brand_Toyota Brand_Volkswagen Brand_Volvo Car_category_Mid_Range Car_category_Luxury_Cars Car_category_Ultra_luxury Predicted price_log Predicted Price residuals Abs_residuals Difference in Lakhs
S.No.
5038 2019 24013 148.31 7 1999 12.81 4.03 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.55 95.00 -0.53 0.59 90.97
5067 2019 17201 187.70 7 2179 12.51 4.08 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.55 95.00 -0.48 0.62 90.92
1583 2019 6773 245.00 7 2967 14.75 4.29 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.55 95.00 -0.26 0.77 90.71
282 2018 11287 241.40 7 2967 12.07 4.25 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.48 88.57 -0.23 0.79 84.32
327 2017 97430 245.00 7 2967 14.75 4.14 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.41 82.58 -0.28 0.76 78.44
3705 2019 29311 170.00 5 2143 17.90 3.93 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.35 77.71 -0.42 0.66 73.77
3043 2019 25899 147.50 5 2179 12.63 4.02 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.35 77.71 -0.33 0.72 73.69
1505 2019 26013 255.00 5 2993 12.65 4.58 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.35 77.71 0.22 1.25 73.13
4345 2016 45180 241.40 7 2967 12.07 3.92 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.34 77.00 -0.43 0.65 73.08
3051 2016 16000 254.80 7 2987 11.30 4.03 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.34 77.00 -0.32 0.73 72.97
918 2016 76000 258.00 7 2993 15.97 4.06 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.34 77.00 -0.28 0.75 72.94
1336 2016 20002 245.00 7 2967 14.75 4.20 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.34 77.00 -0.14 0.87 72.79
489 2017 129000 258.00 7 2987 11.00 4.03 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.31 74.60 -0.29 0.75 70.57
5009 2017 59500 245.00 7 2967 14.75 4.22 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.31 74.60 -0.09 0.91 70.38
2537 2016 28733 241.40 7 2967 12.07 3.97 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.29 72.96 -0.32 0.73 68.99
866 2018 56652 258.00 5 2987 13.00 3.92 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.28 72.45 -0.37 0.69 68.53
4003 2018 3000 187.70 5 2179 16.36 3.96 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.28 72.45 -0.32 0.72 68.49
191 2018 36091 187.70 5 2179 12.70 4.02 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.28 72.45 -0.26 0.77 68.43
2509 2018 11000 194.00 5 1950 16.10 4.03 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.28 72.45 -0.25 0.78 68.42
2767 2018 30706 254.79 5 2987 11.57 4.08 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.28 72.45 -0.20 0.82 68.37
3268 2018 29277 254.79 5 2987 11.57 4.09 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.28 72.45 -0.19 0.82 68.36
2541 2018 21603 265.00 5 2987 13.00 4.25 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.28 72.45 -0.03 0.97 68.20
4938 2015 56238 335.20 7 4134 11.33 3.95 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.27 71.79 -0.33 0.72 67.84
4166 2015 45000 254.80 7 2987 11.30 4.03 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.27 71.79 -0.25 0.78 67.76
2593 2015 21000 241.40 7 2967 12.07 3.96 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.22 68.03 -0.26 0.77 64.07
355 2017 21000 177.00 5 1999 19.33 3.95 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.21 67.55 -0.26 0.77 63.60
404 2017 34990 187.70 5 2179 12.70 3.99 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.21 67.55 -0.23 0.80 63.56
2111 2017 25601 201.15 5 2143 17.90 4.04 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.21 67.55 -0.17 0.84 63.51
310 2017 29000 201.15 5 2143 17.90 4.04 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.21 67.55 -0.17 0.84 63.51
264 2017 29819 254.79 5 2987 11.57 4.12 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.21 67.55 -0.10 0.91 63.43
1830 2017 33552 254.79 5 2987 11.57 4.17 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.21 67.55 -0.04 0.96 63.38
1194 2017 15000 265.00 5 2987 13.00 4.24 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.21 67.55 0.03 1.03 63.31
4614 2017 15000 258.00 5 2993 15.97 4.25 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.21 67.55 0.04 1.04 63.30
1093 2014 50000 335.20 7 4134 11.33 3.93 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.20 66.93 -0.27 0.76 63.00
4360 2018 20004 313.00 4 2993 17.54 4.23 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.18 65.52 0.05 1.05 61.29
5740 2017 21000 258.00 5 2993 15.97 4.01 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.16 64.01 -0.15 0.86 60.00
1496 2016 17652 270.88 5 2993 12.90 4.07 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.14 62.98 -0.07 0.93 58.91
2547 2016 25000 308.43 5 2993 15.87 4.09 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.14 62.98 -0.05 0.95 58.88
5341 2016 65003 382.00 5 4134 13.60 4.22 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 4.14 62.98 0.07 1.08 58.76
5582 2019 28069 174.50 7 2755 12.90 3.43 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 4.06 58.19 -0.63 0.53 54.76
5191 2019 9000 158.00 7 2198 12.62 3.49 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 4.06 58.19 -0.57 0.57 54.70
2855 2019 21719 174.50 7 2755 12.90 3.50 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 4.06 58.19 -0.56 0.57 54.69
3196 2019 9000 197.00 7 3198 10.91 3.57 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 4.06 58.19 -0.49 0.61 54.62
3752 2015 38467 255.00 5 2993 12.65 4.26 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.07 58.72 0.19 1.20 54.46
2907 2015 26000 301.73 5 2993 14.47 4.32 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.07 58.72 0.24 1.28 54.40
5535 2015 97003 308.43 5 2993 15.87 4.44 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.07 58.72 0.37 1.45 54.28
1352 2012 66538 308.00 7 4367 10.20 4.25 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.06 58.18 0.18 1.20 53.93
1113 2018 20446 191.34 5 1950 12.60 3.94 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.05 57.27 -0.11 0.90 53.33
418 2018 22397 367.00 5 2996 11.50 4.26 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.05 57.27 0.21 1.24 53.00
1974 2018 28060 320.00 5 2979 12.05 4.54 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.05 57.27 0.49 1.64 52.73
2669 2018 51530 197.20 7 3198 10.91 3.41 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.99 54.25 -0.58 0.56 50.84
4387 2014 35000 258.00 5 2993 16.46 3.95 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4.00 54.75 -0.05 0.95 50.80
2145 2018 14000 158.00 7 2198 12.62 3.47 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.99 54.25 -0.53 0.59 50.79
1430 2018 30552 197.20 7 3198 10.91 3.48 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.99 54.25 -0.51 0.60 50.77
4906 2018 15000 197.00 7 3198 10.91 3.50 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.99 54.25 -0.50 0.61 50.76
5278 2018 12000 197.00 7 3198 10.91 3.51 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.99 54.25 -0.48 0.62 50.74
655 2017 10501 237.40 5 1999 9.43 4.32 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3.98 53.39 0.34 1.40 49.07
4185 2015 54996 300.00 4 2967 17.85 4.11 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 3.97 53.11 0.14 1.15 49.00
1984 2017 17465 320.00 5 2979 12.05 4.53 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3.98 53.39 0.55 1.74 48.86
1968 2018 37713 168.50 7 2982 12.55 3.45 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 3.94 51.41 -0.49 0.62 47.96
4451 2018 21290 395.00 4 4951 13.00 4.04 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3.95 51.79 0.09 1.10 47.75
5545 2014 47000 255.00 5 2993 12.65 4.17 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3.95 51.88 0.22 1.25 47.71
5441 2017 54650 197.00 7 3198 10.91 3.41 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.92 50.58 -0.51 0.60 47.17
1724 2017 38000 197.00 7 3198 10.91 3.42 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.92 50.58 -0.51 0.60 47.17
5323 2017 17000 197.00 7 3198 10.91 3.43 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.92 50.58 -0.49 0.61 47.15
703 2017 50484 197.20 7 3198 10.91 3.44 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.92 50.58 -0.49 0.62 47.15
1568 2017 40000 148.31 7 1999 12.81 3.73 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.92 50.58 -0.20 0.82 46.86
2178 2017 35000 255.00 7 2993 18.00 3.73 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.92 50.58 -0.20 0.82 46.86
2706 2014 60033 250.00 4 2967 18.18 3.94 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 3.90 49.51 0.04 1.04 45.57
5158 2019 20483 136.00 5 2143 17.90 3.41 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.86 47.60 -0.45 0.64 44.19
5416 2013 75995 335.30 5 4367 11.49 4.19 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3.88 48.37 0.31 1.36 44.18
3567 2019 13414 136.00 5 2143 17.90 3.43 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.86 47.60 -0.44 0.65 44.17
3161 2019 14857 113.98 5 1496 20.70 3.45 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.86 47.60 -0.41 0.66 44.15
4591 2019 38163 174.33 5 1968 17.11 3.49 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.86 47.60 -0.37 0.69 44.10
5875 2019 4000 194.00 5 1950 18.20 3.56 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.86 47.60 -0.31 0.74 44.04
5673 2019 15976 136.00 5 2143 17.90 3.60 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.86 47.60 -0.26 0.77 43.99
4423 2019 17320 170.00 5 2148 15.80 3.64 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.86 47.60 -0.23 0.80 43.96
2528 2016 59000 255.00 7 2993 18.00 3.60 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.85 47.16 -0.25 0.78 43.56
1397 2016 35659 258.00 7 2993 15.97 3.78 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.85 47.16 -0.07 0.93 43.38
3540 2016 62000 187.70 7 2179 12.51 3.80 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.85 47.16 -0.06 0.94 43.37
459 2016 51002 335.20 7 4134 11.33 3.89 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.85 47.16 0.04 1.04 43.27
5603 2013 36400 394.30 4 4806 7.50 4.28 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 3.83 46.16 0.44 1.56 41.89
3242 2012 63000 335.30 5 4367 11.49 4.17 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3.81 45.10 0.37 1.44 40.92
4233 2018 22060 147.51 5 1968 18.51 3.49 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 -0.30 0.74 40.89
5302 2016 31075 241.40 7 2967 12.07 3.91 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.80 44.69 0.11 1.11 40.78
1689 2018 32586 190.00 5 1995 22.48 3.62 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 -0.17 0.84 40.76
4795 2018 21936 190.00 5 1995 21.76 3.64 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 -0.16 0.85 40.74
3978 2018 27467 167.62 5 2143 19.27 3.66 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 -0.13 0.88 40.71
2429 2018 23152 167.62 5 2143 19.27 3.67 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 -0.12 0.88 40.71
224 2018 29161 167.62 5 2143 19.27 3.67 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 -0.12 0.89 40.70
2121 2018 24465 167.62 5 2143 19.27 3.68 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 -0.12 0.89 40.70
772 2018 36797 167.62 5 2143 19.27 3.68 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 -0.12 0.89 40.70
962 2018 8682 194.00 5 1950 18.20 3.68 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 -0.12 0.89 40.70
92 2018 48367 174.33 5 1968 17.68 3.68 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 -0.11 0.89 40.70
2659 2015 25100 300.00 4 2967 17.85 4.32 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 3.80 44.90 0.51 1.67 40.58
5477 2018 52827 187.70 5 2179 12.70 3.80 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 0.01 1.01 40.58
2924 2018 41661 187.70 5 2179 12.70 3.80 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 0.01 1.01 40.57
5927 2018 29091 241.40 5 2967 13.22 3.82 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 0.03 1.03 40.56
603 2018 10003 170.00 5 2143 17.90 3.88 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.79 44.38 0.09 1.09 40.50
5761 2015 55662 241.40 7 2967 12.07 3.76 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3.78 43.97 -0.03 0.97 40.21
  • An Ultra Luxury used car, whose used model sells at 56.26 Lakhs and was predicted to be sold at 95 Lakhs. It is not apparent after looking at numerical predictors, why our model predicted such high value here. This could be because all other Ultra Luxury vehicles in our data seem to have sold at higher prices.

  • There are no instances of false predictions for Mid_Range vehicles.

  • There are no instances where the model predicts lesser than the actual selling price. These could be a cause for concern. The model predicting higher than potential selling price is not good for business in the used car market where a lower-price point is the main attraction.

Let's take a look at the model's missed predictions for the Ultra Luxury used vehicles.

In [212]:
# scatter plot of predications
sns.scatterplot(
    original_df["Difference in Lakhs"],
    original_df["price_log"],
    hue=original_df["Car_category_Ultra_luxury"],
)
Out[212]:
<AxesSubplot:xlabel='Difference in Lakhs', ylabel='price_log'>

Observations

  • There is a clear grouping between the Mid-Range, Luxury, and Ultra Luxury vehicle categories demarcated by price difference.

  • Most outliers are the Ultra Luxury models. Our model predicts that resale value of Ultra Luxury cars is higher compared to Mid and Luxury cars. This is probably the cause of these outliers.

  • Our model seems to be very good at predicting Mid-range, and perhaps therefore Budget as well, vehicle prices.

----------------- END -----------------------------------

In [ ]: